AI Papers Reader

Personalized digests of latest AI research

View on GitHub

JarvisArt: An Intelligent Agent for Liberating Photo Retouching Creativity

JarvisArt is a novel, intelligent agent designed to empower users in photo retouching, bridging the gap between professional-grade editing capabilities and user-friendly accessibility. Developed by researchers, JarvisArt leverages a multi-modal large language model (MLLM) to understand user intent and orchestrate over 200 retouching tools within Adobe Lightroom, mimicking the nuanced decision-making process of professional artists.

Traditional photo editing tools like Adobe Lightroom offer extensive control but require significant expertise and time. While AI solutions have automated many processes, they often lack the flexibility and fine-grained control needed for personalized edits. JarvisArt aims to solve this by acting as an intelligent artistic assistant. It can interpret natural language instructions, even those with vague or complex requests, and translate them into precise editing actions within Lightroom.

The system’s innovation lies in its two-stage training process. First, it undergoes supervised fine-tuning using a “Chain-of-Thought” approach. This trains the agent to break down editing tasks into logical steps, much like a human artist would, ensuring it can reason through complex requests. This is followed by a reinforcement learning phase called Group Relative Policy Optimization for Retouching (GRPO-R). This phase further refines the agent’s decision-making and tool proficiency by rewarding accuracy in tool selection and parameter tuning, as well as overall perceptual quality of the edited image.

A key component of JarvisArt is the Agent-to-Lightroom (A2L) protocol, which enables seamless communication and execution of edits within Lightroom. This integration allows for non-destructive editing and provides feedback between the agent and the software.

To evaluate JarvisArt’s performance, the researchers created MMArt-Bench, a new benchmark dataset comprising real-world user edits and complex retouching scenarios. The results show that JarvisArt significantly outperforms existing methods, including advanced models like GPT-4o, by a considerable margin in terms of content fidelity and instruction following. For instance, it achieved a 60% improvement in average pixel-level metrics for content fidelity.

JarvisArt’s capabilities extend to both global adjustments across an entire image and localized edits to specific regions identified by the user, for example, by drawing a bounding box around a face or object. This allows for highly precise and personalized retouching. The system’s user-friendly interface and intuitive interaction model make advanced photo editing accessible to a wider audience, democratizing creative control over visual storytelling.