2025-09-19

AI for Software Development

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Relevance: This paper focuses on erasing sensitive memorization in Code Language Models (CLMs), addressing critical privacy and security vulnerabilities. From an HCI perspective, ensuring the confidentiality of code and data processed by AI development tools is paramount for developer trust and ethical tool deployment. Unlearning methods enhance the reliability and safety of AI programming assistants, directly impacting user confidence and acceptance of these tools in sensitive software development workflows, fostering more secure human-AI collaboration.

💡 Summary 📄 Full paper

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Relevance: This paper develops a Reinforcement Learning (RL) strategy for Large Language Models (LLMs) to solve mathematical problems by integrating external tools and generating code. For HCI in software development, this highlights the design of intelligent programming assistants that can leverage tools (like interpreters/compilers) and generate correct code, improving developer productivity. Understanding how agents reason through tool use can inform intuitive interfaces for guided code generation, debugging, and verification, making complex AI assistance more accessible and controllable for developers.

💡 Summary 📄 Full paper

AI Agents

The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration

Relevance: This work systematically studies compositional privacy risks and proposes mitigation strategies in multi-agent LLM systems. From an HCI perspective, understanding and addressing such privacy leaks is crucial for building trustworthy AI agents. It directly informs the design of multi-agent systems where human users can be confident that sensitive information is protected during collaboration, enhancing user acceptance, fostering alignment with human values, and promoting ethical deployment of collaborative AI agents.

💡 Summary 📄 Full paper

MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Relevance: This paper presents a medical deep research agent that addresses limitations in specialized domains through knowledge graphs and tailored retrieval tools. For HCI, this demonstrates how expert agents can augment human capabilities by providing accurate, domain-specific information. Designing effective interfaces for such specialized agents will require understanding how humans interact with and verify complex, multi-hop medical reasoning, fostering intuitive human-agent collaboration in high-stakes environments and ensuring alignment with user goals.

💡 Summary 📄 Full paper

Towards General Agentic Intelligence via Environment Scaling

Relevance: This research directly focuses on advancing ‘general agentic intelligence’ by scaling environments for training robust function-calling capabilities. This is vital for creating adaptable AI agents that can operate effectively in various real-world scenarios. From an HCI perspective, generalizable agentic intelligence enables the development of more versatile and reliable AI assistants that can seamlessly integrate into diverse user workflows, promoting intuitive human-agent interaction and collaboration across many domains by reducing the need for domain-specific fine-tuning.

💡 Summary 📄 Full paper

LLM Evaluation Methods

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

Relevance: This paper introduces SteeringControl, a benchmark for holistically evaluating LLM alignment across core objectives like bias, harmful generation, and hallucination. For HCI, comprehensive evaluation of alignment is crucial for building trustworthy and safe LLMs. This benchmark helps identify trade-offs in steering methods and informs the development of models that align better with human values, user expectations, and ethical interaction guidelines, which are fundamental to user satisfaction, trust, and responsible AI deployment.

💡 Summary 📄 Full paper

LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Relevance: This paper develops LongEmotion, a benchmark for assessing LLMs’ Emotional Intelligence (EI) in long-context, realistic interactions. From an HCI perspective, evaluating EI is essential for designing human-LLM communication that is empathetic, nuanced, and effective. Understanding how models handle emotions over extended dialogues can lead to more natural and satisfactory user experiences, particularly in applications like mental health support or customer service, where trust and rapport are critical for effective interaction.

💡 Summary 📄 Full paper

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Relevance: This paper introduces FC-RewardBench, the first benchmark for systematically evaluating reward models in tool-calling scenarios for LLMs. This is critical for HCI because robust evaluation of tool-use agents ensures their reliability and alignment with user intent. Understanding and improving how agents are rewarded for tool actions directly impacts the trustworthiness and effectiveness of interactive AI systems that augment human capabilities with external tools, thereby enhancing user satisfaction and reducing cognitive load during interaction.

💡 Summary 📄 Full paper

Reinforcement Learning

Single-stream Policy Optimization

Relevance: This paper introduces Single-stream Policy Optimization (SPO), a novel policy-gradient optimization method for LLMs that improves stability and efficiency over existing RL algorithms. For HCI in RL, faster and more robust agent learning through SPO could enable more effective human guidance and intervention during training. It facilitates the development of intelligent agents that can quickly adapt to user feedback, leading to more responsive and collaboratively designed AI systems that better align with human intentions and preferences.

💡 Summary 📄 Full paper

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Relevance: This paper proposes an RL strategy for LLMs to solve mathematical problems by integrating external tools and generating code. From an HCI perspective, this research explores how agents learn complex reasoning processes involving tool use. Designing effective interfaces for humans to guide such learning or interpret the agent’s step-by-step tool interactions is crucial for collaborative problem-solving, ensuring user understanding, and fostering trust in agent-generated solutions, aligning with the goal of designing environments for intuitive human-agent collaboration.

💡 Summary 📄 Full paper

RAPTOR: A Foundation Policy for Quadrotor Control

Relevance: This paper presents RAPTOR, a method for training a highly adaptive RL-based foundation policy for quadrotor control, demonstrating zero-shot adaptation to diverse real-world platforms. This has significant HCI implications for human-robot interaction and environment design. An adaptive policy means a human operator can rely on a single control system for various drones, reducing cognitive load and training effort, and enabling more intuitive and flexible human guidance in complex, dynamic environments, thereby improving safety and efficiency of human-agent collaboration.

💡 Summary 📄 Full paper

Explainable AI

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Relevance: Dr.V diagnoses video hallucination by fine-grained spatial-temporal grounding, providing insights into why a video model hallucinates. From an XAI perspective, this method offers concrete, localized explanations for model failures, mirroring human-like comprehension. This transparency is crucial for building user trust and helping users understand model limitations, allowing for more informed decision-making and safer deployment of video understanding AI. It provides a means to interpret agent behaviors and identify areas where humans might need to intervene.

💡 Summary 📄 Full paper

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Relevance: This research aims to enhance ‘visual reflection’ in Vision-Language Models (VLMs), which is the model’s ability to check its reasoning visually. By using visual attention-based rewards, it makes the model’s grounding more explicit. For XAI, this contributes to more interpretable models by revealing what visual information the model focuses on during reasoning. This transparency helps users understand the VLM’s decision process, fostering trust and enabling more effective human-AI collaboration by providing clearer insights into the model’s ‘thought’ process.

💡 Summary 📄 Full paper

AI Papers Reader

Personalized digests of latest AI research

2025-09-19

AI for Software Development

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

AI Agents

The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration

MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Towards General Agentic Intelligence via Environment Scaling

LLM Evaluation Methods

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Reinforcement Learning

Single-stream Policy Optimization

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

RAPTOR: A Foundation Policy for Quadrotor Control

Explainable AI

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models