New Benchmark Forces AI Models to ‘Draw,’ Revealing Major Spatial Reasoning Deficiencies
Tencent Youtu Lab Introduces LTD-Bench, Bridging the Gap Between Abstract LLM Scores and Intuitive Physical World Understanding
A critical blind spot in the evaluation of large language models (LLMs)—their inability to robustly reason about the physical world—has been exposed by a novel benchmark that forces AI to generate visual outputs by “drawing.”
Introduced by Tencent Youtu Lab, the LTD-Bench (Let Them Draw Benchmark) moves away from relying solely on opaque numerical metrics. Instead, it requires LLMs to translate linguistic descriptions into observable visual artifacts, such as dot matrices or executable drawing code, making fundamental limitations in spatial understanding immediately apparent to researchers and non-experts alike.
The developers warn that current evaluation scores often conceal a “dangerous disconnect” between reported statistical performance and the practical abilities required for applications like robotics and autonomous systems, where spatial awareness is paramount.
From Text to Visual Articulation
LTD-Bench implements a comprehensive, dual-path evaluation across three progressively challenging levels, testing both spatial imagination (generation tasks) and spatial perception (recognition tasks).
To build an intuitive understanding of a model’s grasp of space, LTD-Bench transforms abstract concepts into concrete drawing challenges:
- Easy Level (Discrete Grid): Models must perform basic character representation in a finite grid, such as generating the dot matrix
[[1, 0, 1], [1, 1, 1], [1, 0, 1]]when asked to draw the letter ‘H’ in a 3x3 matrix. - Normal Level (Continuous Space): Complexity increases by requiring models to generate executable code that draws specified characters (like a letter ‘W’ or a number ‘9’) using continuous curves in an infinite 2D coordinate system. This tests the ability to map linguistic concepts onto mathematical functions.
- Hard Level (Real-World Objects): This is the ultimate test of spatial imagination, requiring models to draw complex objects with specific attributes, such as “Draw a cat with pointed ears, long whiskers and round eyes.” These generation tasks demand advanced conceptualization and compositional understanding.
State-of-the-Art Models Fall Short
Testing state-of-the-art models, including Deepseek-r1, GPT-4o, and Llama3.3-70B-Instruct, the benchmark revealed an “alarming capability gap.”
Experimental results show that even advanced LLMs, which perform strongly on traditional reasoning tests, exhibit profound deficiencies in establishing reliable bidirectional mappings between language and spatial concepts. Models like Qwen2.5-72B-Instruct and Llama3.3-70B-Instruct achieved overall average accuracy scores of around 30%, indicating poor spatial imagination and perception. Even the highest-performing model, Deepseek-r1, only managed an average of 71.54%.
The case studies of failed attempts were particularly revealing. For example, when asked to draw a clock with the pointer pointing to 9:30 on the Hard level, models often produced wildly inaccurate or chaotic representations, suggesting they lack the spatial imagination to translate abstract temporal instructions into accurate geometric relations.
Furthermore, the visual outputs provide a powerful new diagnostic tool. By comparing the stylistic similarities of drawings generated by different LLMs on the Hard level—such as analyzing distinct ways different models render a house or a rabbit—researchers can gain insights into model similarity that are missed by traditional abstract metrics.
LTD-Bench represents a critical paradigm shift, laying the foundation for developing AI systems with genuinely robust spatial reasoning capabilities necessary for interacting with the physical world.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.