2025-08-08
AI for Software Development
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Relevance: This paper demonstrates the successful application of Reinforcement Learning (RL) to train LLM-based agents to solve real-world software engineering (SWE) tasks. It highlights a viable path towards building more capable autonomous agents for complex, multi-turn SWE problems, directly addressing how AI can assist and automate software development activities, going beyond single-turn code generation to full problem-solving within a development environment.
π‘ Summary π Full paper
LaTCoder: Converting Webpage Design to Code with Layout-as-Thought
Relevance: LaTCoder proposes a novel approach to enhance layout preservation in webpage design-to-code conversion using Multimodal Large Language Models (MLLMs) and a Layout-as-Thought (LaT) strategy. This directly relates to AI for software development by automating a critical front-end UI development task, bridging the gap between visual design and functional implementation. Its focus on accurate layout generation is crucial for developer efficiency and the quality of the generated software artifacts.
π‘ Summary π Full paper
EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation
Relevance: EVOC2RUST introduces an automated framework for converting entire C projects to Rust, addressing the demand for translating legacy codebases for safety-critical systems. By combining LLMs with static analysis and an evolutionary augmentation strategy, it improves syntax and semantic accuracy and code safety. This directly contributes to AI for software development by automating complex code refactoring and migration tasks at a project level, which is a significant challenge for human developers.
π‘ Summary π Full paper
AI Agents
HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization
Relevance: This paper addresses the critical challenge of balancing task performance with emerging risks for LLM-enabled web agents operating in open environments. It proposes HarmonyGuard, a multi-agent framework that uses adaptive policy enhancement and dual-objective optimization to jointly improve safety and utility. This directly contributes to AI agent research by focusing on crucial aspects like safety and alignment, which are paramount for robust and trustworthy agent deployment in real-world human-interactive scenarios.
π‘ Summary π Full paper
Efficient Agents: Building Effective Agents While Reducing Cost
Relevance: This work presents the first systematic study of the efficiency-effectiveness trade-off in modern LLM-driven agent systems, addressing the critical need for cost-effective designs. It investigates how much complexity agentic tasks require, when additional modules yield diminishing returns, and how to gain efficiency. This research is highly relevant to AI Agents as it provides actionable insights for designing sustainable, high-performing, and accessible AI-driven solutions, directly impacting the practical deployment and scalability of agents for human users.
π‘ Summary π Full paper
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Relevance: SEAgent proposes an agentic self-evolving framework enabling Computer Use Agents (CUAs) to autonomously learn and master novel software environments through experiential learning. It designs a World State Model and Curriculum Generator for iterative trial-and-error and task progression. This paper directly advances AI Agents by enabling them to adapt to new digital tools and environments independently, reducing reliance on human-labeled data and paving the way for more generalist and continuously evolving human-computer interaction agents.
π‘ Summary π Full paper
LLM Evaluation Methods
MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine
Relevance: MedBLINK introduces a benchmark designed to probe Multimodal Language Models (MLMs) for basic perceptual abilities in medicine, crucial for clinical decision support. The paper explicitly states that clinicians are selective in adopting AI tools, and errors on simple tasks hinder adoption. This directly addresses LLM evaluation from an HCI perspective by focusing on foundational accuracy, which is critical for user trust, satisfaction, and the ultimate usability of AI tools in high-stakes domains like healthcare.
π‘ Summary π Full paper
Data and AI governance: Promoting equity, ethics, and fairness in large language models
Relevance: This paper covers approaches to systematically govern, assess, and quantify bias across the complete life cycle of machine learning models, focusing on Large Language Models (LLMs). It discusses a data and AI governance framework to address Bias, Ethics, Fairness, and Factuality. From an HCI perspective, this is highly relevant as it provides methods for identifying and mitigating biases, ensuring fairness and inclusivity, which are essential for building trustworthy AI systems and promoting user adoption and societal alignment.
π‘ Summary π Full paper
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
Relevance: FACTORY introduces a large-scale, human-verified prompt set for evaluating the long-form factuality of language models. It highlights that existing benchmarks often lack human verification, leading to quality issues. This paper is crucial for LLM evaluation methods, particularly from an HCI viewpoint, because it emphasizes human involvement in verification and focuses on factuality, which directly impacts the reliability and trustworthiness of LLM outputs, reducing cognitive load for users who would otherwise need to cross-verify information.
π‘ Summary π Full paper
Reinforcement Learning
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Relevance: This paper highlights the successful application of Reinforcement Learning (RL) to train LLM-based agents for complex, multi-turn software engineering tasks. Unlike prior RL research focused on single-turn problems, this work demonstrates RLβs utility in environments providing rich, stateful feedback. From an HCI perspective, this advances how humans can effectively build and interact with agents that learn optimal, sequential behaviors in complex, dynamic software environments, improving collaboration and problem-solving capabilities.
π‘ Summary π Full paper
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Relevance: RL-PLUS proposes a novel hybrid-policy optimization approach for Large Language Models (LLMs) in Reinforcement Learning with Verifiable Reward (RLVR), aiming to surpass the inherent capability boundaries of base models. It addresses issues like distributional mismatch and guides models towards high-value, unexplored reasoning paths. This research is highly relevant to RL as it introduces methods to make RL training for LLMs more effective and robust, enabling agents to learn more complex and generalized behaviors, which in turn facilitates better human-agent collaboration.
π‘ Summary π Full paper
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
Relevance: Agent Lightning presents a flexible and extensible framework for Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. It achieves complete decoupling between agent execution and training, enabling seamless integration with diverse existing agent frameworks. This paper directly contributes to RL research by simplifying and standardizing the process of applying RL to agents, which can lead to more intuitive human guidance of agent learning and better interpretation of agent behaviors as they are trained under a unified paradigm.
π‘ Summary π Full paper
Explainable AI
Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks
Relevance: This paper proposes a lightweight framework leveraging Large Language Models (LLMs) for Root Cause Analysis (RCA) in mobile networks, focusing on interpretability, domain expertise, and causal reasoning. It introduces a two-stage training methodology to generate structured, multi-step diagnostic explanations. This research directly addresses Explainable AI (XAI) by providing methods for LLMs to not only perform complex analytical tasks but also to produce transparent and understandable reasoning paths, which is crucial for building trust and enabling human experts to validate AI decisions.
π‘ Summary π Full paper
CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction
Relevance: CoTox proposes a framework that integrates Large Language Models (LLMs) with Chain-of-Thought (CoT) reasoning for multi-toxicity prediction, combining chemical structure data, biological pathways, and gene ontology. The key contribution is generating interpretable toxicity predictions through step-by-step reasoning. This work significantly contributes to Explainable AI (XAI) by making complex scientific predictions transparent and justifiable, allowing domain experts to understand the underlying rationale, which is vital for trust, validation, and adoption in fields like pharmaceutical development.
π‘ Summary π Full paper
AttnTrace: Attention-based Context Traceback for Long-Context LLMs
Relevance: AttnTrace proposes a new context traceback method based on LLM attention weights to identify which parts of the input context contribute most to a generated response. This directly improves the interpretability and trustworthiness of LLM outputs by showing the modelβs focus, helping users understand decision boundaries and potentially detect prompt injections. This research is central to Explainable AI (XAI), providing a practical tool for increasing the transparency of long-context LLMs, which is crucial for their reliable and responsible deployment.
π‘ Summary π Full paper