New Benchmark Aims to Rigorously Test Financial AI Models

🔊

💬 Ask

San Francisco, CA – June 19, 2025 – Researchers have introduced MultiFinBen, a comprehensive benchmark designed to push the boundaries of large language models (LLMs) in the complex and demanding financial domain. Unlike existing evaluations that often focus on single languages and data types, MultiFinBen embraces a multilingual, multimodal, and difficulty-aware approach, offering a more realistic assessment of LLMs’ capabilities.

The financial world is inherently global and data-rich, involving text, charts, tables, and even audio. Furthermore, financial information is disseminated and understood across numerous languages. Current LLM benchmarks, however, have largely overlooked these critical aspects, often relying on simplified tasks that don’t accurately reflect real-world financial complexities.

MultiFinBen aims to bridge this gap by incorporating a diverse range of challenges. It includes the first multilingual financial question-answering tasks, PolyFiQA-Easy and PolyFiQA-Expert. These tasks require models to process and reason over financial reports and news articles presented in multiple languages, such as English, Chinese, Japanese, Spanish, and Greek. For instance, a model might need to analyze a company’s revenue trends from an English financial report and cross-reference it with news articles in Spanish and Chinese to answer a question.

Beyond text, MultiFinBen introduces novel tasks that test LLMs’ ability to interpret visual and auditory financial data. The EnglishOCR and SpanishOCR tasks challenge models to extract and reason over information embedded within images of financial documents, such as balance sheets and charts. Imagine a model being presented with an image of a company’s financial statement and asked to identify key figures and trends. Similarly, the benchmark includes audio data from earnings calls, requiring models to transcribe and summarize spoken financial information.

A key innovation of MultiFinBen is its difficulty-aware selection mechanism. Instead of simply aggregating existing datasets, the benchmark prioritizes datasets that reveal the most significant differences in performance between LLMs, thereby highlighting model weaknesses and areas for improvement. This means that tasks which even top-performing models struggle with are given more weight, providing a more accurate picture of current AI capabilities.

An extensive evaluation of 22 state-of-the-art LLMs revealed that while leading models like GPT-4o show promising overall performance, they still face substantial limitations, particularly in complex multilingual and multimodal financial scenarios. Monolingual and unimodal models performed significantly worse, underscoring the necessity of cross-modal and cross-lingual reasoning for financial applications. The results also highlight a considerable performance gap between easy and difficult tasks, demonstrating that current models are not yet equipped to handle the full complexity of real-world financial analysis.

By providing a more comprehensive and challenging evaluation framework, MultiFinBen is poised to drive progress in financial AI, encouraging the development of more robust, adaptable, and globally competent language models for the financial industry. The benchmark and its associated datasets have been made publicly available to foster transparency and reproducible research.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark Aims to Rigorously Test Financial AI Models

Chat about this paper