2025-11-21

AI for Software Development

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

Relevance: This paper is highly relevant as it automates the entire Chaos Engineering (CE) cycle, which involves complex software engineering tasks such as requirement definition, code generation, testing, and debugging, using LLM agents. From an HCI perspective, ChaosEater significantly lowers the barrier to entry for building resilient systems, enabling non-experts to leverage sophisticated AI for critical infrastructure tasks. The automation of planning and post-experiment analysis, previously manual and labor-intensive, represents a major shift in how developers interact with resilience testing tools, moving towards fully autonomous AI assistance in the software lifecycle.

💡 Summary 📄 Full paper

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Relevance: This empirical study directly addresses the Human-Computer Interaction aspect of AI for Software Development by examining how developers instruct and provide context to agentic coding tools. The findings reveal a critical HCI gap: while developers provide functional context (e.g., build commands), they rarely specify non-functional requirements like security or performance. This highlights the need for better tooling and interaction design that guides developers to establish necessary guardrails, ensuring that AI-generated code is not just functional but also safe and robust, thus improving the reliability and trustworthiness of AI assistance.

💡 Summary 📄 Full paper

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Relevance: This paper introduces an interactive benchmark specifically designed for evaluating LLM agents in realistic, long-context software engineering tasks. It moves beyond single-turn evaluations to assess crucial HCI-relevant dimensions like multi-turn conversation, tool usage efficiency, and error recovery. This framework is vital for advancing usable AI in software development, as it provides metrics that measure how effectively agents handle complex, multi-step, and adaptive reasoning—qualities essential for reliable and collaborative AI assistants in real-world professional environments.

💡 Summary 📄 Full paper

AI Agents

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Relevance: MiroThinker explores ‘interactive scaling’ as a third dimension of agent performance improvement, where the agent is trained via reinforcement learning to handle deeper and more frequent agent-environment interactions. This is highly relevant to HCI as it focuses on designing agents that actively leverage environmental feedback and external tool calls to refine complex, multi-turn reasoning trajectories. This research provides a roadmap for building more capable and robust agents that can sustain complex collaboration with humans in research or problem-solving workflows.

💡 Summary 📄 Full paper

FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

Relevance: This paper introduces a simulation framework designed for human-centric embodied AI, integrating LLMs for high-level planning and interaction grounded in theories of social cognition. It extends tasks like Vision-and-Language Navigation (VLN) into interaction-enriched settings where agents actively seek and interpret navigational guidance. This work is fundamental for HCI research in embodied agents, emphasizing interaction itself as a crucial information modality necessary for creating agents capable of naturalistic and socially aware collaboration with human users.

💡 Summary 📄 Full paper

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

Relevance: WebCoach addresses a major challenge in agent design: the lack of long-term robustness and learning across sessions. It introduces a model-agnostic framework that provides agents with persistent cross-session memory and a ‘Coach’ for injecting task-specific advice retrieved from past experiences. This self-evolving capability is key for HCI, ensuring agents become more reliable over time without retraining, improving user trust, and reducing the cognitive load required by humans to correct repetitive agent errors.

💡 Summary 📄 Full paper

LLM Evaluation Methods

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Relevance: This paper introduces TopoPerception, a novel benchmark leveraging topological properties for shortcut-free evaluation of Large Vision-Language Models (LVLMs). This evaluation method is crucial for robustness testing, as it exposes fundamental deficits in global visual perception that are often masked by local shortcuts in conventional benchmarks. Identifying such profound limitations, where models perform no better than random chance, is essential for guiding the development of reliable models and ensuring user trust in LVLMs deployed in high-stakes visual reasoning applications.

📄 Full paper

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Relevance: MVI-Bench is the first comprehensive benchmark designed specifically to evaluate LVLM robustness against misleading visual inputs, addressing a critical gap in trustworthy AI development. It introduces a hierarchical taxonomy of misleading inputs (Concept, Attribute, Relationship) and the MVI-Sensitivity metric for granular assessment. This focus on robustness directly relates to HCI concerns around model reliability and safety, providing necessary tools to identify vulnerabilities and guide the creation of more dependable models that maintain user trust under adversarial visual conditions.

💡 Summary 📄 Full paper

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Relevance: This paper advocates for the systematic use of preference alignment techniques to bridge the gap between computational optimization and subjective human musical appreciation. It discusses methods like MusicRL and preference optimization, which fundamentally rely on large-scale human-in-the-loop evaluation to capture nuanced preferences (e.g., temporal coherence and subjective quality). This work directly addresses the HCI goal of evaluation methods that prioritize user satisfaction and alignment with human values over mere computational fidelity, particularly in creative generative AI applications.

💡 Summary 📄 Full paper

Reinforcement Learning

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Relevance: This paper applies Direct Preference Optimization (DPO) to Vision-Language-Action (VLA) models, utilizing novel reward models that combine an action-conditioned world model (WM) with a deviation-from-ground-truth heuristic. This reward-guided post-training is crucial for improving agent reliability and generalization in embodied tasks. From an HCI perspective, defining preference rewards based on verifiable outcomes (like goal progress and deviation) is key to designing training environments where human guidance, even implicitly via collected preferences, leads to dependable and predictable agent behavior.

💡 Summary 📄 Full paper

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Relevance: This foundational work systematically clarifies RL methodologies for LLM Agents by extending the Markov Decision Process (MDP) framework to define key agent components. It introduces Agent-R1, a modular framework for RL-based LLM Agents. This is highly relevant to advancing RL application in agents, which is the mechanism underpinning sophisticated agent behavior. Understanding and formalizing this framework is essential for future HCI research focused on how humans can effectively guide the learning process or interpret the complex policies developed by RL-trained agents.

💡 Summary 📄 Full paper

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

Relevance: SafeGRPO introduces a self-rewarded safety alignment framework that integrates rule-governed reward construction into Group Relative Policy Optimization (GRPO). This technique tackles compositional safety risks in MLLMs by enforcing structured, step-guided safety reasoning. For HCI, this is vital for maintaining safety and alignment with human values. By making the reasoning process interpretable and verifiable through rule-governed rewards, SafeGRPO addresses the need for designing RL systems that are transparent and trustworthy when deployed in sensitive, multimodal environments.

💡 Summary 📄 Full paper

Explainable AI

Error-Driven Scene Editing for 3D Grounding in Large Language Models

Relevance: This paper proposes DEER-3D, an error-driven framework that uses 3D scene editing to generate precise visual counterfactuals, mitigating grounding biases in 3D-LLMs. By diagnosing predicate-level errors (e.g., attribute or spatial relation failures) and applying minimal edits to produce counterfactual supervision, the framework provides inherent explanations. This aligns directly with XAI goals, specifically counterfactual explanations, by showing users/developers exactly how small changes in the input (the 3D scene) would affect the model’s grounding output, helping to understand decision boundaries.

💡 Summary 📄 Full paper

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Relevance: REVISOR introduces a framework for tool-augmented multimodal reflection, enabling MLLMs to construct introspective reasoning processes across textual and visual modalities. Crucially, the Dual Attribution Decoupled Reward (DADR) mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. This explicit alignment acts as a built-in explanation mechanism, promoting transparency by showing which visual segments informed the model’s final decision, thereby advancing interpretability in complex, long-form video understanding tasks.

💡 Summary 📄 Full paper

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

Relevance: While also relevant to RL and Agents, SafeGRPO provides strong XAI implications. It enforces ‘step-guided safety thinking’ and uses rule-governed reward construction to achieve verifiable optimization. This structured reasoning trace acts as an explanation, offering transparency into how the MLLM processes and adheres to safety constraints during decision-making. This approach is superior to opaque safety filters, providing interpretable assurance that the model is aligned with human safety values through a transparent, rule-based reasoning process.

💡 Summary 📄 Full paper

AI Papers Reader

Personalized digests of latest AI research

2025-11-21

AI for Software Development

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

AI Agents

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

LLM Evaluation Methods

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Reinforcement Learning

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

Explainable AI

Error-Driven Scene Editing for 3D Grounding in Large Language Models

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization