AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Accelerating Direct Preference Optimization with Prefix Sharing

Direct preference optimization (DPO) algorithms are a popular way to fine-tune large language models (LLMs) on preference data, which consists of pairs of responses to a shared prompt with a label indicating which response is preferred. However, traditional DPO implementations can be computationally inefficient, especially when dealing with tasks with long shared prompts. This is because the model processes each shared prompt twice—once for each response.

A new research paper titled “Accelerating Direct Preference Optimization with Prefix Sharing” proposes a novel technique called “prefix sharing” to address this redundancy. The technique combines chosen and rejected responses into a single sequence with a shared prefix, thereby avoiding redundant computation of the shared prompt.

To prevent cross-response contamination, the authors use a custom block-sparse attention mask. With this technique, the rejected response is blocked from attending to the chosen response, while allowing both responses to attend to the shared prompt.

The researchers conducted extensive experiments, showing that prefix sharing consistently improves training throughput and computational efficiency for DPO algorithms. This is particularly true when dealing with datasets that have long shared prompts and high prefix-to-completion length ratios.

For example, in experiments using the Capybara dataset, which is a multi-turn chat dataset with long prompts, the authors observed a 1.42Ă— improvement in training throughput when using prefix sharing. This improvement is even greater when combined with sequence packing, a technique for further improving training efficiency.

The authors also found that their prefix sharing approach is generalizable and can be used with other paired preference tuning methods, such as reinforcement learning from human feedback (RLHF).

Overall, the paper presents a valuable contribution to the field of preference optimization. It introduces a simple yet effective technique for accelerating training throughput, making DPO algorithms more efficient and accessible for a wider range of applications and model sizes. This work will undoubtedly have a positive impact on the development of high-quality LLMs.