AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Long-context LLMs can now cite their sources

Large language models (LLMs) are getting better at understanding and responding to complex questions, but they still struggle with hallucinations, where they invent information or cite sources incorrectly. A new paper by researchers at Tsinghua University and Zhipu AI proposes a novel approach called LongCite that enables LLMs to generate responses with fine-grained, sentence-level citations, making them more reliable and trustworthy.

The paper first introduces LongBench-Cite, a new benchmark for assessing LLMs’ performance on long-context question answering with citations (LQAC). LongBench-Cite reveals that existing LLMs struggle with this task, often producing citations that are irrelevant, incomplete, or too coarse-grained. The researchers also found that in-context learning, where the LLM is prompted with the question and context before generating an answer, generally results in lower-quality responses compared to vanilla long-context question answering.

To address these shortcomings, the researchers propose CoF (Coarse to Fine), a novel pipeline for automatically constructing high-quality SFT (supervised fine-tuning) datasets for LQAC. CoF leverages off-the-shelf LLMs to generate long-context QA instances with precise sentence-level citations in four steps:

  1. Generating long-context QA instances: CoF uses the Self-Instruct method to generate queries and answers from a given long text material.
  2. Generating chunk-level citations: The LLM is then used to identify relevant chunks of text from the given material, providing coarse-grained citations for each statement in the answer.
  3. Extracting sentence-level citations: The LLM then extracts precise sentence-level citations from each chunk by identifying supporting sentences.
  4. Filtering: The researchers discard instances with too few citations to ensure that the LLM is generating factual-grounded responses.

Using CoF, the researchers constructed LongCite-45k, a large-scale SFT dataset for LQAC, containing 44,600 high-quality instances with contexts up to 128,000 tokens. They then fine-tuned two open-source long-context models, GLM-4-9B and Llama-3.1-8B, using LongCite-45k, resulting in LongCite-9B and LongCite-8B, respectively.

The evaluation results show that LongCite-9B and LongCite-8B outperform even advanced proprietary models like GPT-40 on LongBench-Cite. They achieve significantly better citation quality, with higher F1 scores and finer granularity, while also achieving state-of-the-art correctness on long-context question answering.

The researchers also found that SFT with citation information effectively reduces hallucinations and enables a more uniform utilization of context, leading to further improvements in response correctness. This suggests that SFT on LQAC data can significantly improve the reliability and trustworthiness of long-context LLMs.

LongCite is a significant step forward in the development of reliable and trustworthy long-context LLMs. By enabling LLMs to generate accurate responses with fine-grained citations, LongCite paves the way for more transparent and verifiable AI systems.

For example, imagine you’re researching a complex topic like the history of the internet. You could ask a LongCite-powered LLM a question like: “What were the major technological advancements that led to the development of the World Wide Web?” LongCite would not only provide a well-structured and informative answer, but it would also include citations for each sentence, allowing you to easily verify the information and explore the sources in more detail. This would help you avoid potential misinformation and build a deeper understanding of the subject.

The paper highlights the importance of citation quality in evaluating LLMs. The ability to generate accurate citations is not only crucial for ensuring the veracity of the information but also for making LLMs more transparent and accountable. This is particularly important in domains like law, finance, and healthcare, where trust and accuracy are paramount.

LongCite is a promising step towards making LLMs more reliable and trustworthy. As LLMs become increasingly integrated into our lives, it is essential to ensure that they are capable of providing accurate and verifiable information. LongCite represents a significant step in this direction, and it is likely to have a profound impact on the future of AI.