AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Answer Matching: A New Standard for Evaluating Language Models

New research suggests that traditional multiple-choice tests may not be the best way to evaluate the capabilities of advanced language models (LMs). Instead, a method called “answer matching,” which involves having a separate LM assess whether a generated response correctly answers a question, shows significantly higher alignment with human judgment.

For years, multiple-choice questions (MCQs) have been the go-to format for evaluating LMs. They are objective and easy to automate, making them a convenient choice for the rapidly expanding field of AI. However, the paper argues that MCQs often allow LMs to “cheat” by identifying correct answers based on patterns or cues within the choices themselves, rather than truly understanding the question. For example, a model might learn to pick the longest or most detailed answer option, even if it doesn’t fully grasp the question’s content. This phenomenon, known as “discriminative shortcuts,” means that MCQ performance might not accurately reflect an LM’s generative abilities.

The researchers propose “answer matching” as a more robust alternative. In this method, an LM is given a question without any answer choices and asked to generate a free-form response. Then, another, more capable LM acts as a “matcher.” This matcher is provided with the original question, the LM’s generated response, and the correct answer. Its task is to determine if the generated response semantically or functionally matches the correct answer.

To validate their approach, the researchers conducted human evaluations on subsets of popular benchmarks like MMLU-Pro and GPQA-Diamond. They found that answer matching, even with smaller, more recent LMs, achieved agreement rates with human graders that were nearly indistinguishable from human-to-human agreement. In contrast, traditional MCQs and other “LLM-as-a-judge” methods that don’t use a reference answer showed significantly lower alignment with human grading.

One compelling example from the paper illustrates this difference: for a question asking about cell components missing from bacteria, one model, when given the multiple-choice options, correctly identified the answer. However, when presented with only the question, it failed to generate a comparable response. This highlights how MCQs can mask underlying weaknesses in generative capabilities.

The study also found that answer matching can lead to different model rankings compared to MCQs. Models that excel in chat-based applications, for instance, often perform better when evaluated with answer matching, suggesting that this method better captures their strengths.

The paper concludes that answer matching represents a significant step forward in evaluating LMs, offering a more accurate and reliable way to measure their generative abilities. This shift could influence how future AI benchmarks are designed, moving beyond the limitations of multiple-choice formats to better assess the true capabilities of advanced language models.