AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Rethinking Prompt Sensitivity: Are LLMs Really That Fickle?

New research suggests that the perceived sensitivity of large language models (LLMs) to how questions are phrased might be more of an evaluation issue than a flaw in the models themselves.

For years, developers and researchers have grappled with “prompt sensitivity” – the frustrating phenomenon where a slight rephrasing of a question can lead to drastically different answers or performance from an LLM. This has cast a shadow on the reliability of LLM evaluations, with the same models sometimes appearing to rank very differently depending on the exact wording of the prompt. For instance, a study mentioned in the paper found that simply changing answer options from letters (A, B, C) to numbers (1, 2, 3) completely flipped the ranking of four open-source models on a specific benchmark.

However, a new paper challenges this long-held assumption. The researchers argue that much of this reported prompt sensitivity isn’t an inherent weakness of LLMs but rather an artifact of the evaluation methods used. They propose a more robust evaluation strategy: using LLMs themselves as judges.

The study systematically tested seven different LLMs, including popular models like GPT and Gemini, across six benchmarks. They employed a variety of prompt templates – over 12 for each benchmark – to see how performance varied.

The core of their investigation lies in the difference between traditional “heuristic” evaluation methods and the “LLM-as-a-Judge” approach. Heuristic methods often rely on rigid rules, like checking for exact keyword matches or simple log-likelihood scores. This can penalize correct answers that are phrased differently, even if they convey the same meaning.

Consider a simple question: “In what war did Rogers learn his battle knowledge?” A heuristic evaluation might fail to recognize “The Great War” as a correct answer if it’s only programmed to look for “World War I.” The LLM, on the other hand, is much better at understanding semantic equivalence.

When the researchers switched to using LLMs as judges, they observed a significant reduction in performance variance across different prompts. This means the models’ performance became much more stable, regardless of how the question was worded. For example, on one benchmark, the accuracy of a model called Gemma-2.0 varied from a low of 0.25 to a high of 0.90 under traditional heuristic evaluation. However, when evaluated by an LLM judge, its accuracy only fluctuated between 0.17, and its ranking among other models became far more consistent.

The study also found that the LLM-as-a-Judge method produced rankings of models that were much more stable and reliable than those from heuristic evaluations. Spearman’s rank correlation, a measure of how consistent rankings are, dramatically improved when using LLM judges, going from around 0.31 to over 0.92 in some cases.

This suggests that LLMs are likely more robust to variations in prompt phrasing than previously thought. The paper concludes that prompt sensitivity is less of a fundamental flaw in LLMs and more of a byproduct of how we’ve been testing them. The authors encourage the wider adoption of LLM-as-a-Judge evaluations to get a clearer and more accurate picture of LLMs’ true capabilities.