2025-02-14
AI for Software Development
Teaching Language Models to Critique via Reinforcement Learning
Relevance: This paper uses Reinforcement Learning (RL) to train LLMs to act as critics, generating actionable feedback for code refinement. From an HCI perspective, this directly improves the quality and usability of AI assistance in software development. High-quality, targeted critiques mitigate compounding errors and reduce the cognitive load on developers, fostering better human-AI collaboration in debugging and refinement tasks. The resulting system acts as an enhanced AI assistant that provides not just a solution, but effective guidance on how to improve it, which is crucial for developer trust and learning.
๐ก Summary ๐ Full paper
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Relevance: CodeI/O enhances LLM reasoning by training them to produce structured Chain-of-Thought (CoT) rationales derived from code logic. For developers using AI assistants for tasks like code completion or bug fixing, clear and logically structured reasoning is essential for verification and trust. By improving the quality and procedural rigor of the AIโs internal reasoning, this method directly enhances the interpretability and trustworthiness of the output, making the AI assistant more effective in collaborative software development workflows.
๐ก Summary ๐ Full paper
AI Agents
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Relevance: This paper addresses the challenge of building robust AI agents capable of operating in complex, dynamic desktop environments (GUIs). This is paramount for HCI, as agents must reliably navigate unstructured digital spaces designed for human interaction. The proposed critique mechanism and dynamic testing benchmark (WorldGUI) are necessary steps toward creating controllable and reliable agents that humans can trust to execute complex, multi-step goals across various software applications, simulating real-world user scenarios.
๐ก Summary ๐ Full paper
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Relevance: Hephaestus focuses on foundational capabilities for LLM agents, specifically API function calling, reasoning, and planning, through specialized continual pre-training. These skills are essential for effective human-agent collaboration. Robust foundational skills ensure the agent can reliably understand complex user goals, select appropriate tools, and adapt to environmental feedback, leading to more predictable, efficient, and useful interactions when deployed in user-facing roles.
๐ก Summary ๐ Full paper
Skill Expansion and Composition in Parameter Space
Relevance: This work introduces a framework (PSEC) for autonomous agents to iteratively learn new skills and compose them efficiently using LoRA modules. This addresses a major challenge in agent design: continuous learning and adaptation to new tasks defined by human users. HCI researchers can leverage this paradigm to design interfaces that allow users to teach new โskill primitivesโ or prompt the composition of existing skills to solve novel, complex, and multi-objective tasks, thereby improving agent flexibility and longevity in dynamic settings.
๐ก Summary ๐ Full paper
LLM Evaluation Methods
Expect the Unexpected: FailSafe Long Context QA for Finance
Relevance: This paper introduces FailSafeQA, a benchmark designed to test LLM robustness and compliance against simulated โhuman-interface interactionsโ like query and context failures in high-stakes finance. This evaluation directly addresses trustworthiness and reliability concerns critical to HCI. By emphasizing the balance between robust answering and the ability to refrain from hallucinating (compliance), the benchmark pushes models toward safer, more reliable behavior when handling complex, ambiguous, or degraded inputs from real users.
๐ก Summary ๐ Full paper
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Relevance: The Chameleon Benchmark Overfit Detector (C-BOD) is a meta-evaluation framework that systematically perturbs prompts to detect model overfitting to superficial dataset cues. This method forces evaluation to move beyond idealized leaderboard scores to prioritize robustness and generalization. From an HCI perspective, robustness to minor input variations is essential for a reliable and predictable user experience, ensuring the model performs consistently and predictably outside of narrow, academic testing environments.
๐ก Summary ๐ Full paper
Auditing Prompt Caching in Language Model APIs
Relevance: This paper develops statistical audits to detect prompt caching in commercial LLM APIs, revealing potential side-channel timing attacks that leak user prompt data. This evaluation is critical from an HCI/Ethics standpoint, as it addresses transparency, privacy, and accountability in deployed systems. Users must have confidence that their inputs are handled securely and confidentially. This work provides a framework for auditing black-box commercial models for non-compliance with expected privacy standards, impacting user trust significantly.
๐ก Summary ๐ Full paper
Reinforcement Learning
Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models
Relevance: This work uses Reinforcement Learning (RL) to dynamically optimize Conformal Prediction (CP) thresholds for uncertainty quantification (UQ). This is a vital application of RL for improving trustworthy AI. By enabling models to selectively abstain when uncertainty is high, the system provides reliable statistical coverage guarantees, which is crucial for safety-critical HCI applications where human oversight must be automatically triggered during high-risk or low-confidence decisions.
๐ก Summary ๐ Full paper
DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Relevance: DPO-Shift addresses the likelihood displacement issue in Direct Preference Optimization (DPO), a core technique in Reinforcement Learning from Human Feedback (RLHF) used for model alignment. Improving the stability and efficiency of DPO is fundamental for training LLMs that reliably adhere to complex human preferences and ethical values. This foundational RL work directly impacts the quality and trustworthiness of human-aligned agents and LLMs used in user-facing applications.
๐ก Summary ๐ Full paper
Competitive Programming with Large Reasoning Models
Relevance: This paper demonstrates that scaling general-purpose Reinforcement Learning (RL) significantly boosts performance on complex human-level reasoning tasks, outperforming specialized, hand-crafted domain strategies. This finding has implications for designing RL environments and policies for human-agent collaboration. It suggests that focusing on robust, generalizable RL scaling, rather than narrow domain engineering, is the path toward creating agents capable of intuitive interaction and high performance across diverse user requests.
๐ก Summary ๐ Full paper
Explainable AI
Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
Relevance: This work uses Sparse Autoencoders (SAEs) to discover human-interpretable visual features and enables controlled interventions to test causal hypotheses about model behavior. This advances XAI from passive visualization (e.g., attention maps) towards scientific rigor and causal understanding. Providing causally validated explanations is essential for building user trust and ensuring that explanations are actionable and relevant for human decision-making and model debugging.
๐ก Summary ๐ Full paper
LLM Pretraining with Continuous Concepts
Relevance: The proposed Continuous Concept Mixing (CoCoMix) framework enhances LLM interpretability and steerability by allowing direct inspection and modification of continuous concepts during the modelโs internal reasoning process. This offers a high degree of transparency in the โblack box.โ For HCI, this provides a transparent mechanism and potential interface for users or developers to guide the AIโs internal logic, making the modelโs behavior more predictable and controllable than traditional post-hoc XAI methods.
๐ก Summary ๐ Full paper