LLMs are Sensitive to How You Ask the Question: A Framework for Understanding Prompt Sensitivity in Large Language Models

💬 Ask

Large Language Models (LLMs) are powerful, but they can be surprisingly sensitive to how you phrase a question. This sensitivity can make it hard to evaluate their performance accurately and can lead to inconsistent results for users.

A new research paper titled “ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs” introduces a novel framework called ProSA to understand and quantify prompt sensitivity in LLMs.

Imagine asking an LLM to solve a math problem. One user might ask, “Solve the following problem: {problem}. Include your answer after the line ‘Final Answer:’”. Another user might ask, “Please provide a solution to the following problem: {problem}”. Even though the underlying task is the same, these slight variations in phrasing can drastically affect the LLM’s performance.

ProSA, the framework proposed by the paper, addresses these shortcomings by focusing on instance-level prompt variations. Instead of comparing performance across different datasets, ProSA examines how LLMs respond to variations in the phrasing of the same question, or “instance,” within the same dataset.

This new level of analysis allows ProSA to identify several important insights:

Prompt Sensitivity Varies: LLMs exhibit different levels of prompt sensitivity across different tasks and datasets. For example, LLMs may be more robust to variations in phrasing when solving simple reading comprehension questions compared to complex math problems.
Larger Models are More Robust: Larger language models generally exhibit greater robustness to prompt sensitivity than smaller ones. This suggests that increased model size can help alleviate sensitivity issues.
Few-Shot Examples Help: Including a few examples of how to phrase a request before asking the LLM a new question can significantly reduce prompt sensitivity.
Confidence Matters: Higher confidence in an LLM’s response is correlated with greater robustness to prompt variations.

ProSA also proposes a novel metric, PromptSensiScore (PSS), to quantify prompt sensitivity at the instance level. PSS is calculated by comparing the differences in an LLM’s responses to different versions of the same question.

The authors also investigate the underlying reasons for prompt sensitivity. They find that the model’s decoding confidence, or its certainty in the answer, is a major factor. Higher confidence is generally correlated with lower sensitivity.

The work presented in this paper is a crucial step towards understanding and mitigating the challenges posed by prompt sensitivity in LLMs. The ProSA framework and the new insights into the underlying mechanisms of prompt sensitivity will help researchers and developers to build more robust and reliable LLMs in the future.

AI Papers Reader

Personalized digests of latest AI research

LLMs are Sensitive to How You Ask the Question: A Framework for Understanding Prompt Sensitivity in Large Language Models

Chat about this paper