AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Deep learning models are increasingly capable of generating human-quality text, speech, and images, but these models can also exhibit undesirable behaviors, such as making factual errors, generating offensive content, or failing to follow instructions. To address these limitations, researchers have developed a process called preference tuning, which involves incorporating human feedback into the training process to align the model’s outputs with human preferences.

A new paper, “Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey,” by Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, and Sambit Sahu, provides a comprehensive overview of the recent advancements in preference tuning and model alignment.

The paper is organized into three main sections:

  1. Introduction and Preliminaries: This section provides a brief introduction to reinforcement learning frameworks and preference tuning tasks. It also outlines the different modalities that are involved in preference tuning (language, speech, and vision), and the various datasets that are commonly used to train these models.
  2. Preference Tuning: This section delves into the different preference tuning approaches, including supervised fine-tuning (SFT), reward modeling, and reinforcement learning from human feedback (RLHF). Each approach is broken down into its various components, including the datasets used, the models trained, and the optimization methods used.
  3. Online Alignment: This section explores the different online alignment methods, which involve continuously updating the model as new data becomes available. It covers several methods, such as:
    • Reinforcement Learning from Human Feedback (RLHF): This method involves using a reward model to evaluate the model’s outputs, and then using reinforcement learning to update the model’s parameters to maximize the reward.
    • Online Directed Preference Optimization (Online DPO): This method involves collecting human preference data online, and using this data to update the model’s parameters.
    • SFT-like methods: This section covers several methods that are similar to SFT, but also incorporate some aspects of preference tuning.

The paper also discusses the different offline alignment methods, which involve training the model on a fixed dataset of human preference data. These methods are generally less computationally expensive than online methods, but they may be less effective at adapting to changing preferences.

The paper concludes with a discussion of future research directions in preference tuning. The authors highlight several important areas, including:

The authors’ survey provides a valuable resource for researchers and practitioners who are interested in understanding the latest advancements in preference tuning and model alignment. The paper offers a comprehensive overview of the field, and it highlights several promising research directions that could lead to more powerful and reliable deep learning models.