Why Your AI Thinks You Should Walk to the Car Wash: The "Shortcut" Flaw in LLM Logic
If you told a friend you wanted to get your car washed and noted the car wash was only 50 meters away, they wouldn’t suggest you leave the car in the driveway and walk there. Yet, some of the world’s most advanced artificial intelligence models do exactly that.
A new study from researchers at Carnegie Mellon University reveals a systemic “reasoning vulnerability” in large language models (LLMs). The paper, titled “The Model Says Walk,” demonstrates how AI frequently ignores common-sense physical constraints in favor of statistical shortcuts, or “heuristics.”
The Heuristic Trap
The researchers began by analyzing a viral brain teaser: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”
Logically, you must drive; you cannot wash a car that isn’t physically present at the car wash. However, across various models, the AI consistently recommended walking. Why? Because the models have been trained on vast amounts of data where “short distance” is almost always associated with “walking.” This is a surface heuristic—a rule of thumb that works 99% of the time but fails spectacularly when a specific, unstated requirement (an implicit constraint) is present.
To test how deep this rabbit hole goes, the team developed the Heuristic Override Benchmark (HOB), a set of 500 scenarios designed to bait AI into making these logical blunders.
Concrete Examples: When Speed Overrides Reality
The HOB benchmark covers four types of misleading cues: proximity, efficiency, cost, and semantic similarity. To understand how these trip up an AI, consider the “Heavy Safe” problem included in the study:
- The Scenario: A user needs to move a 500-pound gun safe to an upstairs bedroom and asks for the “quickest way.”
- The Heuristic: “Quickest” usually means doing it yourself rather than waiting for a professional service.
- The Reality: A human cannot physically lift 500 pounds alone.
- The AI Failure: Many models recommend carrying it yourself because the “efficiency” cue (saving time) overrides the “capability” constraint (human strength limits).
In another example involving “Scope,” an AI might suggest a gas station for tire repairs simply because the words “gas station” and “car” are semantically linked, even if the prompt implies a level of damage that only a specialized mechanic can fix.
The “Inference Bottleneck”
The study’s most revealing finding is that models don’t necessarily lack the knowledge to answer correctly; they simply fail to “activate” it.
When the researchers gave the models a tiny hint—such as bolding the words “get my car washed”—accuracy jumped by 15 percentage points. This suggests an “inference bottleneck.” The AI knows that cars need to be at car washes, but it prioritizes the statistical shortcut of “short distance = walk” before it even considers the mechanics of the goal.
Can We Fix It?
The researchers tested a potential cure called “goal-decomposition.” By forcing the model to list the necessary conditions for a goal before answering (e.g., “To wash a car, the car must be at the facility”), accuracy improved significantly for models like Llama 4 and GPT-5.4.
As AI moves into sensitive fields like medical triage or legal advice, these errors become more than just funny quirks. If a model prioritizes a “standard procedure” heuristic while ignoring a patient’s unique physical constraint, the consequences could be dire. For now, the study serves as a reminder: while AI can process a world of data, it still struggles to navigate the simple logic of the physical world.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.