AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs in RAG: A New Metric to Assess Trustworthiness and Techniques to Improve It

Large language models (LLMs) are increasingly being used in retrieval-augmented generation (RAG) systems, where the LLM’s role shifts from providing information to consolidating it from retrieved documents. However, LLMs are susceptible to “hallucinations,” where they produce incorrect information, making their use in RAG systems risky.

To address this issue, researchers from the Singapore University of Technology and Design have developed a new metric called TRUST-SCORE to assess the trustworthiness of LLMs in RAG systems. TRUST-SCORE evaluates an LLM across four key dimensions:

  1. Grounded Refusals: The LLM’s ability to correctly identify unanswerable questions and refuse to answer them.
  2. Exact Match Recall: The accuracy of the LLM’s generated claims when compared to the gold claims.
  3. Citation Recall: The extent to which the generated claims are supported by the provided citations.
  4. Citation Precision: The relevance of the citations provided by the LLM.

The researchers found that commonly used prompting techniques for enhancing LLM trustworthiness, such as in-context learning, fail to effectively adapt LLMs for RAG tasks. Therefore, they propose a new framework called TRUST-ALIGN to align LLMs for higher TRUST-SCORE. TRUST-ALIGN builds an alignment dataset consisting of 19,000 question-document-response triplets, where the responses are labeled as either preferred (correct) or unpreferred (incorrect). By training LLMs on this dataset, TRUST-ALIGN significantly improves the LLM’s ability to produce trustworthy responses in RAG systems.

For example, the researchers evaluated their approach on the ASQA, QAMPARI, and ELI5 datasets. They found that LLaMA-3-8b, when aligned using TRUST-ALIGN, significantly outperformed other open-source LLMs of comparable size on TRUST-SCORE. Specifically, it achieved a 10.73% increase on ASQA, a 29.24% increase on QAMPARI, and a 14.88% increase on ELI5.

This work provides a valuable contribution to the field of RAG systems by introducing a new metric for assessing LLM trustworthiness and proposing a novel framework for improving it. The findings of this research highlight the importance of aligning LLMs for specific tasks and addressing the challenge of hallucinations in order to ensure the reliable use of LLMs in RAG systems.