2026-05-15

AI for Software Development

Orchard: An Open-Source Agentic Modeling Framework

Relevance: This paper introduces Orchard-SWE, a specialized agentic modeling recipe designed for coding tasks. It addresses the lack of scalable open-source infrastructure for software engineering agents by distilling trajectories from proprietary models and applying credit-assignment supervised fine-tuning. From an HCI perspective, Orchard enables more accessible development of autonomous coding assistants that can solve complex repository-level issues. By providing lightweight sandbox lifecycle management, it allows researchers to build and evaluate tools that assist developers in multi-turn interactions, reducing the cognitive load of manual code refactoring and bug fixing.

💡 Summary 📄 Full paper

An Empirical Study of Automating Agent Evaluation

Relevance: While focused on evaluation, this study specifically examines the ability of frontier coding assistants to generate complex software artifacts, such as evaluation pipelines. It reveals a gap between general coding ability and domain-specific software engineering, showing that simple prompting leads to over-engineered and brittle solutions. The introduction of EvalAgent, which uses procedural instructions and reusable code templates, demonstrates how AI can be better structured to assist in the software development lifecycle, particularly for the specialized task of building robust testing and metric-generation infrastructure.

💡 Summary 📄 Full paper

AI Agents

PREPING: Building Agent Memory without Tasks

Relevance: PREPING addresses the ‘cold-start’ problem in agent deployment by allowing agents to build procedural memory through self-generated synthetic practice before encountering real tasks. This is a significant advancement for AI agents, as it reduces the reliance on curated demonstrations. From an HCI standpoint, this reduces the user’s burden of providing initial training data or guidance. The framework ensures that agents arrive in new digital environments with pre-established skills, making them more autonomous and reducing the deployment time and cost for the end user.

💡 Summary 📄 Full paper

EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Relevance: EvolveMem introduces a self-evolving memory architecture where the agent autonomously optimizes its own retrieval configurations and strategies. This move from fixed to adaptive infrastructure allows agents to better handle long-term, multi-session interactions. In HCI, this is crucial for building personalized agents that grow more effective over time. By diagnosing its own failures and proposing architectural adjustments, the agent minimizes the need for manual tuning by human developers, leading to a more seamless and intelligent user experience in long-horizon tasks.

💡 Summary 📄 Full paper

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Relevance: This survey provides a comprehensive roadmap for multi-agent systems, focusing on how specialized agents can coordinate and self-evolve. It identifies the ‘LIFE’ progression (Foundation, Integration, Fault Attribution, Evolution) as a framework for building collective intelligence. For HCI researchers, this paper highlights the causal dependencies in agent collaboration and the risks of error propagation. Understanding these dynamics is essential for designing multi-agent interfaces where humans can effectively monitor, diagnose, and collaborate with groups of autonomous software entities.

💡 Summary 📄 Full paper

LLM Evaluation Methods

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Relevance: MemEye addresses a critical gap in multimodal agent evaluation: the preservation of fine-grained visual evidence. Many current benchmarks allow agents to ‘cheat’ using text-based traces. MemEye forces agents to reason over pixel-level details and visual state changes over time. From an HCI perspective, this is vital for ensuring that visual assistants (e.g., for navigation or accessibility) truly understand the visual environment rather than relying on superficial textual summaries, thereby improving user trust and system reliability in visually-intensive tasks.

💡 Summary 📄 Full paper

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Relevance: FutureSim proposes a novel evaluation method by replaying real-world events in chronological order to test how agents adapt to new information. This shifts evaluation from static benchmarks to dynamic, open-ended simulations. This is highly relevant to HCI because real-world AI deployment happens in non-stationary environments. By measuring Brier skill scores and adaptation over long horizons, FutureSim provides a more realistic assessment of how agents will perform when assisting users with volatile, real-world information, such as news or financial events.

💡 Summary 📄 Full paper

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Relevance: RealICU challenges the standard evaluation practice of imitating historical clinician actions, which may be suboptimal. Instead, it uses hindsight-annotated ‘gold’ labels from expert reviews of full patient trajectories. This is a landmark for evaluation in high-stakes HCI domains. It exposes critical failure modes like ‘anchoring bias’ in AI, where models stick to early, potentially wrong interpretations. This benchmark provides the rigorous, clinically-grounded testing necessary to develop AI decision-support systems that humans can safely rely on in intensive care settings.

💡 Summary 📄 Full paper

Reinforcement Learning

Self-Distilled Agentic Reinforcement Learning

Relevance: SDAR addresses the difficulty of applying RL to multi-turn agents where trajectory-level rewards are too coarse. By introducing token-level guidance through self-distillation, it provides a denser supervision signal. This is relevant to HCI because multi-turn interactions are the bedrock of human-agent collaboration. SDAR’s ability to improve performance across diverse environments like search and web shopping suggests it can help create more responsive and accurate agents that learn more effectively from complex, interactive user feedback than standard RL methods.

💡 Summary 📄 Full paper

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Relevance: This paper proposes an RL framework that teaches agents when to explore and when to execute, based on uncertainty. This is a fundamental challenge in the exploration-exploitation trade-off. From an HCI perspective, an agent that knows when it lacks information and needs to ‘explore’ (e.g., by asking a user a question or searching) is far more useful than one that blindly executes. This RL approach scales the agent’s ability to gather environmental feedback, leading to more efficient task completion and reduced user frustration.

💡 Summary 📄 Full paper

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Relevance: PhyMotion uses a physics simulator to provide structured rewards for RL-based video generation, focusing on kinematic plausibility and balance. In the context of HCI, this is essential for generating realistic human movements in virtual environments or simulations. By grounding the RL reward in physical laws rather than just 2D perceptual metrics, the model produces motion that is more convincing and less likely to exhibit ‘uncanny valley’ effects, which is critical for user immersion and the training of vision systems.

💡 Summary 📄 Full paper

Explainable AI

Nexus : An Agentic Framework for Time Series Forecasting

Relevance: Nexus bridges the gap between numerical forecasting and contextual reasoning. Crucially for XAI, it produces high-quality reasoning traces that explicitly detail the fundamental drivers behind its forecasts (e.g., identifying specific news events affecting a stock). For human users, especially in high-stakes financial or real estate domains, this ‘explanation’ of a prediction is often more valuable than the number itself. It allows human experts to audit the AI’s logic, fostering trust and better-informed human-in-the-loop decision-making.

💡 Summary 📄 Full paper

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Relevance: This paper introduces the CAFE benchmark, which uses counterfactual attribute manipulation to determine if segmentation models are truly grounding concepts or just finding visual shortcuts. This is a classic XAI methodology applied to computer vision. By modifying attributes like material or context while keeping the mask the same, it reveals whether a model’s decision-making is based on superficial mimicry. This is essential for HCI applications like autonomous driving or medical imaging, where it is critical to know if the AI ‘understands’ the object it is identifying.

💡 Summary 📄 Full paper

AI Papers Reader

Personalized digests of latest AI research

2026-05-15

AI for Software Development

Orchard: An Open-Source Agentic Modeling Framework

An Empirical Study of Automating Agent Evaluation

AI Agents

PREPING: Building Agent Memory without Tasks

EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

LLM Evaluation Methods

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

FutureSim: Replaying World Events to Evaluate Adaptive Agents

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Reinforcement Learning

Self-Distilled Agentic Reinforcement Learning

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Explainable AI

Nexus : An Agentic Framework for Time Series Forecasting

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?