AI Papers Reader

Personalized digests of latest AI research

View on GitHub

CIGEVAL: AI Framework Achieves Human-Level Reliability in Evaluating AI-Generated Images

Researchers at the Harbin Institute of Technology (Shenzhen) have introduced CIGEVAL, a novel AI framework designed to comprehensively evaluate the quality of conditional image generation. This technology addresses a critical gap in the field by offering a task-agnostic, reliable, and explainable evaluation metric that closely aligns with human assessment.

Conditional image generation, where images are created based on specific prompts or reference images, has seen rapid advancements. Think of creating an image of a cat wearing a detective hat, or changing the color of a car in a photograph based on a text instruction. However, accurately judging the quality of these AI-generated images has remained a challenge. Existing methods often fall short due to their narrow focus, limited explainability, and discrepancies with human judgment.

CIGEVAL tackles these issues by employing a unified agentic framework powered by large multimodal models (LMMs). At its core, CIGEVAL functions like an intelligent agent, integrating a diverse set of tools and establishing a fine-grained evaluation process. This “toolbox” includes functionalities for object grounding (locating specific items within an image), highlighting regions of interest, identifying pixel differences, and generating scene graphs (structured representations of objects and their relationships).

For example, if CIGEVAL is asked to evaluate an image where a user prompted “replace the glasses with the subject”, it can utilize its grounding tool to identify the glasses in the source and subject images. It then uses other tools to analyze the shapes and designs of the two glasses, thus concluding with human annotators’ score.

A key innovation is CIGEVAL’s ability to synthesize evaluation trajectories for fine-tuning smaller LMMs. This allows relatively smaller models (7B parameter open-source LMMs) to autonomously select appropriate tools and perform nuanced analyses, mirroring human-level reasoning.

Experiments across seven prominent conditional image generation tasks demonstrated that CIGEVAL, particularly the GPT-4o version, achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Impressively, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEVAL surpassed previous GPT-4o-based state-of-the-art methods.

Case studies further highlight CIGEVAL’s capability in identifying subtle issues related to subject consistency and adherence to control guidance. This suggests great potential for automating image generation task evaluation with human-level reliability, making it easier to measure progress in the field.