AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs are Sensitive to How You Ask the Question: A Framework for Understanding Prompt Sensitivity in Large Language Models

Large Language Models (LLMs) are powerful, but they can be surprisingly sensitive to how you phrase a question. This sensitivity can make it hard to evaluate their performance accurately and can lead to inconsistent results for users.

A new research paper titled ā€œProSA: Assessing and Understanding the Prompt Sensitivity of LLMsā€ introduces a novel framework called ProSA to understand and quantify prompt sensitivity in LLMs.

Imagine asking an LLM to solve a math problem. One user might ask, ā€œSolve the following problem: {problem}. Include your answer after the line ā€˜Final Answer:ā€™ā€. Another user might ask, ā€œPlease provide a solution to the following problem: {problem}ā€. Even though the underlying task is the same, these slight variations in phrasing can drastically affect the LLMā€™s performance.

ProSA, the framework proposed by the paper, addresses these shortcomings by focusing on instance-level prompt variations. Instead of comparing performance across different datasets, ProSA examines how LLMs respond to variations in the phrasing of the same question, or ā€œinstance,ā€ within the same dataset.

This new level of analysis allows ProSA to identify several important insights:

ProSA also proposes a novel metric, PromptSensiScore (PSS), to quantify prompt sensitivity at the instance level. PSS is calculated by comparing the differences in an LLMā€™s responses to different versions of the same question.

The authors also investigate the underlying reasons for prompt sensitivity. They find that the modelā€™s decoding confidence, or its certainty in the answer, is a major factor. Higher confidence is generally correlated with lower sensitivity.

The work presented in this paper is a crucial step towards understanding and mitigating the challenges posed by prompt sensitivity in LLMs. The ProSA framework and the new insights into the underlying mechanisms of prompt sensitivity will help researchers and developers to build more robust and reliable LLMs in the future.