Multimodal LLMs Show Promise in Art Evaluation, But Hallucinations Remain a Challenge
A new study from researchers at the Hong Kong Polytechnic University has explored the potential of multimodal large language models (MLLMs) to evaluate the aesthetic quality of artworks. Their findings, published in a preprint on arXiv, demonstrate that while MLLMs show promise in this domain, they are prone to hallucinations, producing subjective and often inaccurate assessments.
The researchers constructed MM-StyleBench, a new large-scale dataset of stylized images and corresponding text prompts. The dataset contains diverse content and styles across various artistic domains and boasts detailed attribute annotations. This allows for a principled approach to modeling human aesthetic preferences which serves as a crucial benchmarking step.
To quantify the aesthetic alignment of MLLMs with human preferences, the researchers conducted a comprehensive correlation study using a two-alternative forced-choice (2AFC) paradigm. This approach involves presenting human evaluators with a pair of stylized images and asking them to choose the preferred one. The same images were then presented to the MLLMs for assessment. They found that off-the-shelf MLLMs often struggled to align their judgments with those of humans.
The core issue identified was the tendency of MLLMs to produce “hallucinations”—subjective and often unfounded statements about the artwork’s aesthetic qualities. For example, an MLLM might describe an image as “evocative” or “moving” without providing concrete evidence to support these claims.
To address this problem, the researchers developed a novel prompting method called ArtCoT (Art Chain-of-Thought). ArtCoT decomposes the art evaluation task into three distinct phases:
-
Content/Style Analysis: The MLLM is prompted to analyze the visual features of the stylized images and the reference style, providing a detailed description of elements such as color schemes, strokes, and composition.
-
Art Critic: This phase aims to reduce hallucinations by prompting the MLLM to analyze the previously described visual features against established artistic principles and domain-specific knowledge, effectively mirroring the process of formal analysis often employed by art critics.
-
Summarizer: Finally, the MLLM summarizes the findings from the previous two steps and provides a final assessment.
The researchers found that ArtCoT significantly improved the aesthetic alignment of MLLMs with human preferences across different models, consistently achieving a substantial improvement in correlation scores compared to base prompting and zero-shot chain-of-thought prompting.
For instance, consider a stylized image mimicking a painting in the style of Jackson Pollock. A standard MLLM might describe it as “powerful” or “expressive” without providing specific details about the brushstrokes or color palette. In contrast, ArtCoT prompts the MLLM to first identify and describe these visual elements (“large, dynamic brushstrokes,” “vibrant color contrasts”), then analyze these features within the context of Pollock’s oeuvre before generating a final assessment. This step-by-step breakdown helps mitigate hallucinations.
Despite the significant progress, ArtCoT does not eliminate hallucinations entirely. The researchers suggest further research to investigate more advanced prompting techniques, along with the exploration of model architectures specifically designed for aesthetic reasoning.
The study highlights the potential of MLLMs in art evaluation, while emphasizing the critical need for carefully designed prompting strategies to mitigate the inherent risks of hallucinations. The creation of MM-StyleBench offers a valuable resource for future research in this domain. The work could also have broader implications, extending to other subjective domains that require both visual and linguistic understanding.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.