arxiv:2402.01878

LiPO: Listwise Preference Optimization through Learning-to-Rank

Published on Feb 2, 2024

· Submitted by

akhaliq on Feb 6, 2024

Upvote

Authors:

Tianqi Liu ,

Zhen Qin ,

Misha Khalman ,

Peter J. Liu ,

Abstract

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the Listwise Preference Optimization (LiPO) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment withDPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two preference alignment tasks.

View arXiv page View PDF Add to collection

Community

NickyNicky

Feb 6, 2024

the code?

TianqiLiuAI

Paper author Feb 28, 2024

Please check Appendix C for a sample code on computing different loss functions.

librarian-bot

Feb 7, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

jamesliu23

Feb 7, 2024

Code, please.

molonelaveh

Feb 16, 2024

Article is interesting, but without code and validation is theory.

sakhaki

Feb 19, 2024

The paper is also very similar to this paper:
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

https://huggingface.co./papers/2402.10038