AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New Language, Old Problems: LLMs Struggle with Explicit Linguistic Reasoning

Large Language Models (LLMs) have demonstrated remarkable abilities in understanding and generating human language, often achieving top scores on various benchmarks. However, a new study, “The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang,” suggests that this success may stem more from sophisticated pattern matching than genuine reasoning. Researchers have developed a novel constructed language, “Camlang,” to test LLMs’ ability to learn and apply grammatical rules and vocabulary from explicit instruction, mimicking how humans learn new languages.

The core of the research lies in Camlang, a language designed to be typologically plausible (meaning its features exist in real-world languages) yet novel in its specific combinations. This ensures that LLMs cannot rely on pre-existing knowledge from their training data. Camlang is presented with a detailed grammar book and a bilingual dictionary, providing the explicit learning resources that human second-language learners use.

To evaluate LLMs, the researchers adapted the CommonsenseQA dataset into Camlang, creating “Camlang-CSQA-v0.” This task requires models to answer commonsense questions by first correctly interpreting the Camlang input using the provided grammar and dictionary, and then applying their reasoning abilities.

The results are striking: while a state-of-the-art LLM like GPT-5 achieves nearly 98% accuracy on the English version of CommonsenseQA, its performance plummets to just 47% on Camlang. Human participants, on the other hand, achieve a high accuracy of 87% with the same Camlang resources, demonstrating that the language is indeed learnable. Other leading LLMs performed even worse.

Human verification of model reasoning traces further revealed that many of the LLMs’ “correct” answers on Camlang were likely due to shallow lexical matching or relying on implicit English knowledge, rather than true understanding of Camlang’s grammatical structure. While GPT-5 showed some signs of “emerging metalinguistic awareness,” it still fell short of the systematic grammatical mastery exhibited by humans.

The study concludes that Camlang provides a valuable new benchmark for assessing LLMs’ ability to perform “metalinguistic deductive learning.” This type of learning, where one uses explicit rules about language to understand and reason, is crucial for genuine linguistic competence. The findings highlight a significant gap between the current capabilities of LLMs and human-like reasoning, particularly when faced with novel linguistic systems that cannot be solved through pattern recognition alone. The researchers plan to expand Camlang to a wider range of tasks to further probe these limitations.