AI Can Solve Math, But Can It ‘MacGyver’ a Solution? New Benchmark Reveals a Creative Gap

🔊

💬 Ask

In the world of artificial intelligence, Large Language Models (LLMs) are currently celebrated for their “analytical” and “practical” intelligence. They can pass the Bar exam, debug complex code, and even plan multi-step travel itineraries. However, a new research paper from a team of scientists at UIUC, Amazon, and Columbia suggests that these models are missing a vital third pillar of human intelligence: creativity.

The paper introduces CreativityBench, a first-of-its-kind benchmark designed to test whether AI can go beyond the instruction manual and engage in “creative tool use.” In human terms, this is the “MacGyver” ability—the capacity to repurpose everyday objects based on their physical properties rather than their intended names.

The “Affordance” Problem

To understand the research, one must understand “affordances.” An affordance is an action an object allows. For example, a heavy book’s canonical use is for reading, but its affordances include its ability to serve as a paperweight (due to mass) or a doorstop (due to shape and friction).

Currently, LLMs suffer from “functional fixedness.” They know a key is for opening a lock because that is what they have read in millions of training documents. But if you tell an AI it needs to open a sealed box and only has a key, it might struggle to realize that the key’s rigid, sharp tip structure affords prying or cutting.

Inside CreativityBench

To test this, the researchers built a massive Affordance Knowledge Base (KB) containing over 4,000 entities and 150,000 annotations. This database doesn’t just list objects; it breaks them down into parts and attributes. For instance, it knows a “vacuum bag” is usually for dust, but if it’s “dry and fully filled,” its attribute of being “dense and compressible” allows it to be repurposed as a temporary cushion.

Using this KB, the team generated 14,000 tasks that require identifying non-obvious solutions under constraints. In one example, a model might be asked to retrieve debris from the bottom of a deep pool without a long net. A creative solution might involve using long tongs to grasp a short pool skimmer, effectively extending the reach by combining the two tools.

The Findings: A 60% Performance Drop

The researchers evaluated 10 state-of-the-art models, including the GPT and Qwen families. The results were a wake-up call for the industry. While models were decent at picking a plausible object (the “Entity Correct Rate”), they failed miserably when asked to identify the specific part and physical mechanism needed to solve the task.

The performance plummeted by over 60% when models moved from general object selection to “gold-level” grounding. Essentially, the AI could “guess” that a screwdriver might help, but it couldn’t explain that the thin, rigid blade provides the specific leverage needed to pry a lid.

Even more concerning, the researchers found that simply making models bigger or using “Chain-of-Thought” prompting (asking the AI to think step-by-step) provided almost no improvement. The models often “hallucinated” physical properties—suggesting, for example, that an inflatable pool toy could be used as a wedge to hold up a car axle, ignoring the reality that the material would simply pop under the weight.

Why This Matters

As we move toward a future of “embodied AI”—robots that live and work in our homes—this gap becomes a safety and utility issue. A robot that only knows how to use tools for their intended purposes will be useless in an emergency or an unscripted situation.

CreativityBench reveals that true intelligence isn’t just about processing text; it’s about understanding the physical reality of the world. For now, it seems the AI revolution still has a lot to learn from the humble ingenuity of a human with a paperclip and a dream.

AI Papers Reader

Personalized digests of latest AI research

AI Can Solve Math, But Can It ‘MacGyver’ a Solution? New Benchmark Reveals a Creative Gap

The “Affordance” Problem

Inside CreativityBench

The Findings: A 60% Performance Drop

Why This Matters

Chat about this paper