LLM Trading Agents Fail Real-World Test: High Reasoning Scores Don't Guarantee Financial Alpha

🔊

💬 Ask

URBANA, IL – A new evaluation benchmark, LiveTradeBench, has revealed a crucial disconnect between the general intelligence of large language models (LLMs) and their competence in real-world financial decision-making. The study, conducted by researchers at the University of Illinois, Urbana-Champaign, found that top-performing LLMs on popular reasoning tests like LMArena do not necessarily achieve superior trading outcomes when faced with live market uncertainty.

LiveTradeBench is designed to move beyond traditional, static backtesting environments by requiring LLM agents to manage multi-asset portfolios using real-time streaming data from two structurally distinct markets: U.S. stocks (equities) and Polymarket (prediction markets). This live, dynamic setup captures true uncertainty, volatility, and real-time news flow, eliminating common pitfalls like information leakage. Agents are tasked not with simple buy/sell decisions, but with strategic portfolio allocation across diverse assets, including tech stocks like Nvidia (NVDA) and high-stakes geopolitical betting contracts.

The 50-day live evaluation across 21 mainstream LLMs yielded surprising results. Researchers found a near-zero correlation between a model’s general LMArena score and its financial performance, measured by the Sharpe ratio. In fact, for the high-volatility Polymarket, the correlation was slightly negative. This suggests that the reasoning mechanisms valued in conventional benchmarks—like math or coding—do not straightforwardly translate to sound financial judgment under pressure.

However, the study also confirmed that LLM agents are far from random guessers, exhibiting distinct and adaptive trading personalities. Models like Grok-4 displayed conservative strategies with low volatility and smaller drawdowns, while others, notably GPT-5, demonstrated aggressive, risk-seeking behavior, accepting higher volatility in pursuit of greater cumulative returns.

The agents proved highly adaptive, particularly during periods of market stress. For example, during a sharp U.S. stock market drawdown on October 10 (when Tesla, Amazon, and Nvidia prices fell significantly), the average cash ratio across all models jumped from roughly 7.5% to 17%. The top-performing model, Gemini-2.5-Pro, increased its cash position to 35%, successfully mitigating losses and demonstrating clear, defensive risk management guided by explicit internal reasoning about volatility protection.

Analysis of the agents’ decision-making rationales confirmed they rely heavily on real-time news (cited 98% of the time in Polymarket decisions) and price momentum. However, this reliance highlighted a key challenge: differentiating high-impact events from superficial noise. In the “Russia-Ukraine Ceasefire” Polymarket, agents initially overreacted to minor, attention-grabbing headlines but later made profitable strategic holds based on credible, high-impact diplomatic news, proving that timely, grounded adaptation is key.

LiveTradeBench provides an end-to-end framework demonstrating LLMs’ ability to perceive, reason, and act in uncertain financial environments. The findings stress the need for future LLM benchmarks to focus on continuous sequential decision-making, memory integration, and strategic adaptation, rather than isolated static reasoning tasks.

AI Papers Reader

Personalized digests of latest AI research

LLM Trading Agents Fail Real-World Test: High Reasoning Scores Don't Guarantee Financial Alpha

Chat about this paper