LLMs are Sensitive to How You Ask the Question: A Framework for Understanding Prompt Sensitivity in Large Language Models
š Full Paper
š¬ Ask
Large Language Models (LLMs) are powerful, but they can be surprisingly sensitive to how you phrase a question. This sensitivity can make it hard to evaluate their performance accurately and can lead to inconsistent results for users.
A new research paper titled āProSA: Assessing and Understanding the Prompt Sensitivity of LLMsā introduces a novel framework called ProSA to understand and quantify prompt sensitivity in LLMs.
Imagine asking an LLM to solve a math problem. One user might ask, āSolve the following problem: {problem}. Include your answer after the line āFinal Answer:āā. Another user might ask, āPlease provide a solution to the following problem: {problem}ā. Even though the underlying task is the same, these slight variations in phrasing can drastically affect the LLMās performance.
ProSA, the framework proposed by the paper, addresses these shortcomings by focusing on instance-level prompt variations. Instead of comparing performance across different datasets, ProSA examines how LLMs respond to variations in the phrasing of the same question, or āinstance,ā within the same dataset.
This new level of analysis allows ProSA to identify several important insights:
- Prompt Sensitivity Varies: LLMs exhibit different levels of prompt sensitivity across different tasks and datasets. For example, LLMs may be more robust to variations in phrasing when solving simple reading comprehension questions compared to complex math problems.
- Larger Models are More Robust: Larger language models generally exhibit greater robustness to prompt sensitivity than smaller ones. This suggests that increased model size can help alleviate sensitivity issues.
- Few-Shot Examples Help: Including a few examples of how to phrase a request before asking the LLM a new question can significantly reduce prompt sensitivity.
- Confidence Matters: Higher confidence in an LLMās response is correlated with greater robustness to prompt variations.
ProSA also proposes a novel metric, PromptSensiScore (PSS), to quantify prompt sensitivity at the instance level. PSS is calculated by comparing the differences in an LLMās responses to different versions of the same question.
The authors also investigate the underlying reasons for prompt sensitivity. They find that the modelās decoding confidence, or its certainty in the answer, is a major factor. Higher confidence is generally correlated with lower sensitivity.
The work presented in this paper is a crucial step towards understanding and mitigating the challenges posed by prompt sensitivity in LLMs. The ProSA framework and the new insights into the underlying mechanisms of prompt sensitivity will help researchers and developers to build more robust and reliable LLMs in the future.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.