AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs Can Now Generate Code Better Than Ever, Thanks To This New Technique

Large Language Models (LLMs) are rapidly changing the world, but they still struggle with complex tasks like code generation. One major issue is the inaccurate “reward model” used to train LLMs, which judges the quality of a code response based on human feedback. A new technique called “Policy Filtration for Proximal Policy Optimization” (PF-PPO) aims to solve this problem.

PF-PPO is based on the observation that even inaccurate reward models can be more reliable in specific regions. For example, a reward model might give a response a score of 0.8, indicating that it is relatively good. However, it might give another response a score of 0.5, indicating that it is mediocre. Researchers have found that, generally, the reward model is more accurate when it assigns high or low scores.

PF-PPO leverages this insight by filtering out responses with mediocre scores during training. The technique focuses on responses with higher scores, thereby improving the accuracy of the reward model’s signal and ultimately improving the performance of the LLM.

The research paper titled “Policy Filtration in RLHF to Fine-Tune LLM for Code Generation”, available on arXiv, provides extensive experimental evidence for the effectiveness of PF-PPO. The researchers fine-tuned LLMs with 7 billion parameters, demonstrating that PF-PPO significantly outperforms existing techniques in code generation tasks. For example, PF-PPO achieved state-of-the-art results on the HumanEval and MBPP benchmarks, both of which are widely used to evaluate code generation. The researchers also built a new, more challenging LeetCode Contest benchmark, and PF-PPO performed well on this benchmark, too.

The research suggests that PF-PPO represents a promising approach to improving code generation with LLMs. By better utilizing human feedback, LLMs can generate even more impressive code and further accelerate the progress of software development.