AI Papers Reader

Personalized digests of latest AI research

View on GitHub

First Multimodal AI Benchmark for Russian, MERA Multi, Challenges LLMs with Culture-Specific Tasks

A consortium of researchers known as the MERA Team has introduced MERA Multi, the first open multimodal evaluation framework designed specifically for Russian-language AI architectures. Addressing a persistent gap in AI assessment dominated by English-centric benchmarks, MERA Multi comprises 18 new tasks spanning image, audio, video, and text modalities, offering a rigorous, culturally grounded evaluation of Large Multimodal Language Models (MLLMs).

Existing global benchmarks often fail when directly applied to languages like Russian due to typological complexity (Cyrillic script, rich morphology) and significant cultural context unique to native speakers, such as references to Soviet-era media or specific folklore. MERA Multi directly tackles this challenge by constructing datasets from scratch, ensuring linguistic and cultural specificity across all tasks.

A New Taxonomy for Multimodal Abilities

The benchmark organizes its 18 tasks under a unified taxonomy covering three broad categories of human-like cognition: Perception, Knowledge, and Reasoning. This structure allows for a fine-grained assessment of MLLMs’ capabilities, ranging from simple object recognition to complex, multi-step logical inference.

To ensure robustness and prevent data contamination, MERA Multi utilizes a sophisticated evaluation methodology. It employs standardized block-prompting to mitigate evaluation bias and features a dual-level scoring system: standard Exact Match (EM) for format adherence and a learned LLM-as-a-Judge Score (JS) for semantic correctness. Furthermore, private datasets within the benchmark are protected using digital watermarking and data leakage detection methods.

To build intuition for the framework’s scope, consider two examples:

  1. Multimodal Knowledge: The ruHHH-Image task evaluates ethical reasoning and safety. A model might be shown a picture related to a culturally sensitive event and must choose the most helpful and harmless response from several Russian-language options, demonstrating sensitivity to local societal norms.
  2. Multimodal Reasoning and Perception: The LabTabVQA task presents models with screenshots of medical tables (images) from telemedicine consultations. The model must perform visual text recognition (OCR) on the Russian text within the image, understand the table structure, and apply basic math to answer questions, such as calculating the sum of specific lab indicators.

Omni-Models Lead, but Gaps Persist

The MERA team evaluated over 50 publicly available and proprietary MLLMs, including those from Qwen, LLaVA, and GPT-4.1 families, establishing a comprehensive set of baselines. The results reveal that general-purpose “omni-models” tend to achieve the highest Total Scores, largely driven by their broad coverage across modalities.

However, the analysis highlighted significant disparities in model performance across different inputs. While image understanding is a relatively mature area, models struggle more acutely with complex audio and video tasks. For instance, audio tasks assessing spoken language understanding (RuSLUn) and video tasks requiring precise temporal localization and action sequence reasoning (RealVideoQA) consistently yielded lower scores, underscoring a critical need for further development in these underrepresented modalities for non-English models.

The MERA Team hopes the benchmark will serve as a foundational blueprint for developing culturally aware multimodal evaluations for other typologically diverse non-English languages, particularly within the Slavic family, accelerating transparent and reliable progress in global AI research.