AI Papers Reader

Personalized digests of latest AI research

View on GitHub

String Matching Makes a Comeback: BLEU Shows Surprising Power in Aligning Language Models

Researchers at the University of Maryland, College Park, Kensho, and Lambda AIO have discovered that a simple, often-dismissed metric called BLEU (Bilingual Evaluation Understudy) can be surprisingly effective in aligning large language models (LLMs) to follow instructions. This finding could significantly reduce the cost and complexity of LLM alignment, a crucial step in ensuring these models are helpful and harmless.

Traditionally, aligning LLMs relies on reinforcement learning using reward models, which are trained on vast amounts of human-labeled preference data. This process is expensive and requires powerful computing resources. The new research, however, suggests that a readily available, reference-based metric like BLEU can achieve comparable results.

BLEU, originally designed for evaluating machine translation, measures the overlap between a generated text and a set of reference texts. It’s essentially a string-matching algorithm, counting how many words and phrases in the generated text also appear in the reference texts. It has long been considered too simplistic for evaluating the nuances of open-ended language generation.

The researchers found that BLEU, when combined with a clever training strategy, can guide LLMs to produce more helpful, factual, and well-formatted responses. They developed a method called BLEUBERI, which first identifies challenging instructions where the LLM’s initial responses score poorly according to BLEU. Then, it uses reinforcement learning to optimize the LLM’s responses, directly using BLEU as the reward function.

To understand the power of BLEU, consider a task where the LLM is asked to translate a sentence into Ukrainian and English. A good response would include both translations and correct formatting. BLEU can be surprisingly effective, as it can be set up to reward outputs that have the headers “Ukrainian” and “English”, as well as the proper translations.

The researchers tested BLEUBERI on various challenging instruction-following benchmarks, such as ArenaHard and WildBench, using different base language models like Llama-3.1-8B and Qwen2.5-7B. The results showed that BLEUBERI-trained models were competitive with, and sometimes even exceeded the performance of, models trained using traditional reward model-guided reinforcement learning. Human evaluations also revealed that the quality of BLEUBERI model outputs was on par with those from reward model-aligned models, and even more factually grounded.

“Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment,” the researchers state.

This research has significant implications for the field of LLM alignment. By demonstrating the effectiveness of BLEU as a reward signal, it offers a more accessible and cost-effective alternative to traditional methods, potentially democratizing access to aligned LLMs. The code and data for BLEUBERI are publicly available, enabling other researchers to explore and build upon these findings.