AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Large Language Models Struggle to Master Analogies, But Targeted Knowledge Helps

Large language models (LLMs), the powerful AI systems behind chatbots like ChatGPT, are remarkably adept at many tasks. However, a new study published in arXiv reveals a surprising weakness: solving proportional analogies, a cornerstone of human cognitive ability. Researchers from the University of South Carolina, IIIT Hyderabad, Amazon, and Meta found that even the most advanced LLMs struggle with these tests, achieving accuracy rates far below human performance. However, the study also demonstrates that carefully crafted prompts, enhanced with specific types of knowledge, can significantly boost the models’ abilities.

The study centers around proportional analogies, which are presented in the format A:B::C:D (A is to B as C is to D). For example, the analogy “Oxygen: Gas :: Aluminum: ?” requires identifying the relationship between “Oxygen” and “Gas” (Oxygen is a type of Gas) and finding a word that shares the same relationship with “Aluminum” (Metal). The researchers compiled a massive 15,000-question multiple-choice question answering (MCQA) dataset for evaluating LLM performance. This dataset significantly exceeds the size of previously used datasets, offering a more robust benchmark.

Nine different LLMs, including both open-source and proprietary models like GPT-3.5-Turbo, were tested. The results were sobering. Even the most advanced models achieved only around 55% accuracy using the best prompting technique – a far cry from the near-perfect accuracy of humans. The simple zero-shot approach (presenting the analogy without any additional context) yielded even lower accuracy rates, highlighting the models’ inherent difficulties in this task.

However, the researchers found that the way they prompted the LLMs made a considerable difference. Simple few-shot prompting (providing a few examples before the test question) showed some improvement. But a dramatically more successful approach was “targeted knowledge prompting,” where the prompt explicitly highlighted the semantic relationship (e.g., “is a type of”) between the first pair of terms. This strategy led to a significant jump in accuracy, underscoring the importance of providing context tailored to the specific task. In contrast, providing general structured knowledge, such as definitions from WordNet or ConceptNet, did not yield similar improvement. This suggests the need for precisely targeted, not simply related, information.

The study’s findings are important for several reasons. First, it reveals a significant gap in the current capabilities of LLMs. While impressive in many areas, these models still struggle with tasks that are relatively straightforward for humans. Second, the study highlights the power of effective prompting in shaping LLM performance. The right kind of prompt, tailored to the task and infused with the appropriate knowledge, can unlock capabilities that are otherwise latent. Finally, the study underscores the importance of continuing research on LLM reasoning and problem-solving skills, especially in areas where human intuition and knowledge play a crucial role. Future research will likely focus on developing more sophisticated prompting strategies and training methods to improve LLMs’ analogical reasoning abilities.