A Single Character Can Make or Break Your LLM Evaluations

🔊

💬 Ask

Researchers have discovered that the seemingly minor choice of punctuation used to separate examples in prompts can drastically impact the performance of large language models (LLMs), even leading to manipulated rankings.

In the realm of artificial intelligence, particularly with the rise of sophisticated large language models (LLMs), how we communicate with these systems – through “prompts” – is crucial. A common technique to guide LLMs towards desired outputs involves providing a few examples within the prompt itself. This method, known as few-shot prompting, has been widely adopted in evaluating and using LLMs. However, a new study reveals a surprisingly overlooked factor that can profoundly influence LLM performance: the character used to separate these examples.

The research, published on September 30, 2025, found that changing a single punctuation mark, such as a comma, a newline character, or a hashtag, can lead to dramatic shifts in an LLM’s accuracy. For instance, on the MMLU benchmark, a popular test for assessing LLM knowledge, performance differences of up to 29.4% were observed simply by altering the delimiter character. This performance gap is equivalent to several years of progress in LLM development.

Examples of the Impact

Imagine you are asking an LLM to identify the capital of France. You might provide it with a few examples like this:

Example 1: QUESTION: What is the capital of France? A: Nice, B: Paris, C: Bordeaux, D: Lyon ANSWER: B

Example 2: QUESTION: What is 22*3? A: 223, B: 62, C: 22, D: 66 ANSWER: D

The study demonstrates that if the comma in the first example were replaced with a different character, like a newline or a hashtag, the LLM’s ability to correctly answer subsequent questions could be significantly impaired. This sensitivity is not limited to specific LLM families like Llama or Qwen; it pervades across leading models and a variety of benchmark tasks, including those testing knowledge, reasoning, and even simple dictionary lookups.

Manipulating Rankings

Perhaps the most striking finding is that by strategically choosing the delimiter character, researchers can manipulate the rankings of different LLMs. This means that a model that performs poorly with one delimiter could be made to appear superior by simply changing this single character. This raises serious questions about the reliability and fairness of current LLM evaluation protocols.

Understanding the Mechanism

To delve deeper into why this seemingly minor detail has such a significant impact, the researchers investigated how LLMs process information. They found that effective delimiters help steer the model’s “attention” – its internal focus mechanism – towards the most relevant parts of the provided examples. When the right delimiter is used, the LLM is more likely to concentrate on the crucial “key tokens” in the input, leading to better performance. Conversely, a poor delimiter can scatter the model’s attention, hindering its ability to grasp the task.

Improving Robustness

Recognizing this “brittleness,” the study explored ways to make LLMs more resilient to such variations. One effective method is to explicitly tell the model which delimiter to expect within the prompt itself. By adding a sentence like, “The following examples are separated by [delimiter character],” LLMs show significantly improved and more consistent performance across different delimiters. The researchers also identified that newline characters (\n) and exclamation marks (!) tend to be strong performers on average, offering a practical recommendation for users and evaluators.

The findings underscore a critical gap in our understanding of how LLMs learn and process information. The study concludes that while LLM evaluations have standardized many aspects, the subtle yet powerful influence of formatting details like example delimiters remains largely unexplored, highlighting the need for more robust and nuanced evaluation methodologies.

AI Papers Reader

Personalized digests of latest AI research

A Single Character Can Make or Break Your LLM Evaluations

Chat about this paper