AI Papers Reader

Personalized digests of latest AI research

View on GitHub

2025-10-31

AI for Software Development

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Relevance: This paper directly addresses the use of AI for code intelligence by generating code from textual, visual, or combined inputs. It proposes a “visual-programmatic interface,” which is a core HCI concern for developers interacting with AI tools. The models (JanusCoder) are designed to assist software development by unifying code generation and understanding, integrating visual feedback to enhance developer productivity and transform user interaction with coding tools.

💡 Summary 📄 Full paper

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Relevance: This paper introduces a protocol to unify agent training datasets, explicitly mentioning “coding” and “software engineering” tasks for LLM agents. It’s crucial for building robust AI assistants that can perform complex development workflows. From an HCI viewpoint, improving agent fine-tuning for these tasks directly impacts the reliability, capability, and therefore the usability and trustworthiness of AI tools for software developers, making them more effective in real-world scenarios.

💡 Summary 📄 Full paper

AI Agents

TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling

Relevance: This paper presents an AI agent, TheraMind, designed for complex, long-term human-centric interaction in psychological counseling. Its dual-loop architecture for strategic planning and adaptive dialogue demonstrates key agentic capabilities like reasoning, planning, and learning from experience. From an HCI perspective, it addresses critical challenges in designing highly sensitive human-AI collaboration, focusing on emotional understanding, adaptive strategies, and ensuring continuity and ethical alignment in therapeutic relationships.

💡 Summary 📄 Full paper

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Relevance: This paper introduces Toolathlon, a vital benchmark for evaluating language agents in real-world, long-horizon tasks across diverse applications and tools. It directly assesses agents’ abilities to break down complex tasks, plan actions, and interact with digital environments. From an HCI perspective, this benchmark is critical for understanding the practical capabilities and limitations of AI agents in scenarios where human users rely on them for multi-step workflows, highlighting areas for improved human-agent collaboration and trust.

💡 Summary 📄 Full paper

Evolving Diagnostic Agents in a Virtual Clinical Environment

Relevance: This paper presents DiagAgent, an LLM-based diagnostic agent trained with reinforcement learning in a virtual clinical environment. It demonstrates agents’ ability to manage multi-turn processes, adaptively select examinations, and make reasoned decisions. This directly relates to AI agents performing complex, goal-oriented tasks. From an HCI standpoint, understanding how such agents learn, adapt, and interact in a high-stakes domain like diagnostics is crucial for designing trustworthy, interpretable, and effective AI support for human professionals.

💡 Summary 📄 Full paper

LLM Evaluation Methods

Automating Benchmark Design

Relevance: This paper introduces BeTaL, a framework that leverages LLMs to automate the design of dynamic benchmarks for evaluating LLMs and LLM-powered agents. This directly addresses the limitations of static benchmarks. From an HCI perspective, automated and dynamic benchmarks can more efficiently assess model capabilities, limitations, and realism, ultimately helping to understand how models perform in diverse real-world scenarios, which is crucial for informing user experience, trust, and alignment with user needs.

💡 Summary 📄 Full paper

MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion

Relevance: This paper provides a crucial framework for evaluating LVLMs’ susceptibility to multimodal persuasion, addressing how models might adopt misleading beliefs or generate unsafe outputs. This is a direct ethical and bias evaluation. From an HCI standpoint, understanding and mitigating model persuadability is vital for developing responsible AI that maintains user preferences, fosters trust, and prevents the generation of unethical content, ensuring alignment with human values and safety.

💡 Summary 📄 Full paper

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Relevance: This paper introduces the first comprehensive benchmark, VisJudge-Bench, for evaluating MLLMs’ ability to assess visualization aesthetics and quality. This extends LLM evaluation beyond text-centric tasks into a domain highly relevant to human comprehension and user experience. From an HCI perspective, evaluating AI’s judgment of visualization quality is essential for designing AI tools that generate or critique visual data, ensuring that outputs are clear, faithful, and aesthetically pleasing, thereby reducing cognitive load and enhancing user trust.

💡 Summary 📄 Full paper

Reinforcement Learning

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Relevance: This paper proposes FAPO, a method to improve Reinforcement Learning (RL) for LLM reasoning by addressing “flawed-positive rollouts” and internalizing reliable reasoning patterns. From an HCI perspective, ensuring efficient and reliable reasoning in RL-trained models directly enhances user trust and predictability when interacting with AI systems. By focusing on process-level rewards and mitigating unreliable patterns, it contributes to developing AI agents whose behaviors are more robust and aligned with human expectations.

💡 Summary 📄 Full paper

Reasoning-Aware GRPO using Process Mining

Relevance: This paper augments RL-based post-training for Large Reasoning Models with “reasoning procedure” signals using process mining, moving beyond outcome-centric rewards. This is a significant advancement in RL for complex AI. From an HCI perspective, enabling models to learn from and be rewarded for how they reason, not just the final answer, can lead to more interpretable and robust AI. This facilitates human understanding of agent behavior and design of environments for intuitive human-agent collaboration.

💡 Summary 📄 Full paper

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

Relevance: This paper tackles the critical problem of “capability regression” where RLVR training causes models to forget foundational skills. It proposes RECAP to preserve general knowledge. From an HCI perspective, preventing such forgetting is vital for maintaining user trust and ensuring the long-term utility of AI agents. Users expect consistent, broad capabilities, and this research directly supports the development of more stable and reliable RL-trained models that can sustain their general intelligence over time.

💡 Summary 📄 Full paper

Explainable AI

S-Chain: Structured Visual Chain-of-Thought For Medicine

Relevance: This paper introduces S-Chain, a dataset with “structured visual CoT” that explicitly links visual regions to reasoning steps in medical VLMs, aiming for improved interpretability and grounding fidelity. This is a direct contribution to XAI, especially in high-stakes domains. From an HCI perspective, providing transparent alignment between visual evidence and textual rationales is crucial for building trustworthy medical AI, allowing human experts to understand, verify, and ultimately rely on AI decisions with greater confidence.

💡 Summary 📄 Full paper

Latent Chain-of-Thought for Visual Reasoning

Relevance: This paper reformulates reasoning in LVLMs and introduces a training algorithm for “latent Chain-of-Thought” (CoT), explicitly aiming for enhanced interpretability and generalization. By encouraging diverse, high-likelihood latent CoT, it seeks to make the internal reasoning process more comprehensible. From an HCI perspective, improving the interpretability of visual reasoning is fundamental for users to understand why an LVLM makes a certain prediction, fostering trust and enabling more effective human oversight and decision-making.

💡 Summary 📄 Full paper