AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Why AI Can’t Handle "Maybe": New Research Shows LLMs Struggle with Human Uncertainty

Imagine you hear there was a major accident on the highway. Your brain immediately starts calculating: traffic is probably going to be a nightmare. But it’s not a certainty. Perhaps the vehicles were moved quickly, or the news broke early enough that everyone took a different route.

Humans navigate these “gray zones” of probability every day. We don’t just think in terms of True or False; we think in “likely,” “unlikely,” and “could go either way.” However, a new study from researchers at McGill University and several other institutions reveals that even the most advanced “reasoning” Large Language Models (LLMs) are failing to replicate this fundamental human trait.

The paper, titled “Humans and LLMs Diverge on Probabilistic Inferences,” introduces a new benchmark called PROBCOPA. This dataset consists of 210 everyday scenarios designed to test how we reason when information is limited. The researchers compared the judgments of hundreds of humans against eight state-of-the-art AI models, including GPT-5, Gemini-3, and DeepSeek-R1.

The Human Nuance

To build an intuition for the study, consider a simple premise: “A drought occurred in the region.”

When asked how likely it is that “the crops perished,” most humans might give it a high score, say an 85 or 90. But if asked how likely it is that “the water became contaminated,” the responses become much more varied. Some people might see a strong link; others might see it as a stretch.

In the study, humans rated these outcomes on a sliding scale from 0 to 100. The results showed that human reasoning is “graded.” For many scenarios, there was no single “correct” answer, but rather a spread of opinions that formed a “tri-modal” distribution: some people saw the outcome as very unlikely, some as very likely, and a significant group landed right in the middle, reflecting genuine uncertainty.

The AI “Extremism”

AI models, the researchers found, are far less comfortable with the middle ground. While models like GPT-5 and Claude Sonnet-4.5 are excellent at formal logic and math, they struggle with the messy uncertainty of common sense.

When faced with the same 0-to-100 scale, the AI models were consistently overconfident. They tended to gravitate toward the extremes—viewing a hypothesis as either almost certain (100) or virtually impossible (0). While humans showed a healthy amount of disagreement and “maybe” votes, the models rarely produced responses in the middle of the scale.

Even when the researchers tried to “nudge” the models to be more human—by giving them specific personas (like a 23-year-old barista or a 58-year-old factory worker) or increasing the “randomness” of their outputs—the models still couldn’t replicate the organic spread of human judgment.

Thinking Harder, Not Smarter

The study also analyzed “reasoning chains”—the “thinking out loud” process that modern models use to show their work.

The researchers discovered a fascinating correlation: when a scenario was difficult for humans (meaning participants disagreed more and took longer to answer), the AI models also generated longer reasoning chains. It seems the models “know” when a problem is complex, but they still can’t translate that complexity into a human-like distribution of likelihood.

For example, when an AI was asked about a girl who looked pale and whether her father read her a story, it might generate a long list of alternative possibilities—like the father giving her medicine or taking her to a doctor—to justify its final score. Yet, despite this internal deliberation, its final numerical estimate remained far more rigid than a human’s.

Why This Matters

As AI is increasingly integrated into human-focused settings—from medical advice to legal assistance—this divergence becomes a critical safety issue. If a model is inherently overconfident in situations where a human would be cautious and uncertain, it could lead to “hallucinated” certainty in high-stakes decisions.

The researchers conclude that to build truly intelligent machines, we must evaluate them on more than just math and logic. We need models that can understand the “maybe”—the vast, uncertain middle ground where most of human life takes place.