AI Papers Reader

Personalized digests of latest AI research

View on GitHub

The "Silent Janitor" Problem: Why AI Coding Agents Struggle with Software Observability

As AI coding agents like Devin and GitHub Copilot transition from simple autocomplete tools to independent “teammates” capable of submitting entire pull requests, a new concern is emerging among software engineers: observability. While these agents are increasingly good at writing code that works, a new study suggests they are surprisingly poor at writing code that can be debugged.

The research, titled “Do AI Coding Agents Log Like Humans?”, conducted by a team at Queen’s University and ETS Montréal, performed an empirical post-mortem on 4,550 agent-generated pull requests across 81 major open-source repositories. The verdict? AI agents are failing a critical “non-functional” test of software engineering: logging.

The Airplane Black Box of Code

In software development, logging acts as an airplane’s black box. When a system crashes at 3:00 AM, logs tell the engineer what the program was doing right before the failure. Without them, the code is a “black box,” and fixing errors becomes a guessing game.

The study found that AI agents are fundamentally less proactive about logging than humans. In 58.4% of the studied repositories, humans modified or added logs more frequently than their AI counterparts. While agents are proficient at “reactive” logging—such as adding an error message when a piece of code explicitly fails (a try/catch block)—they often skip the “proactive” logging that describes a system’s healthy state (the INFO level logs).

For example, a human developer might insert a log saying, “User session validated for ID 123,” to track a successful workflow. An AI agent, however, is more likely to stay silent unless something breaks, leaving future maintainers in the dark about the system’s normal behavior.

The “Silent Janitor” Burden

Perhaps the most striking finding is the emergence of what the researchers call “silent janitors.” When AI agents submit code with poor or missing logs, human reviewers are quietly cleaning up the mess. The study found that humans perform 72.5% of all post-generation log repairs.

Interestingly, humans rarely complain about these missing logs in code review comments. Instead, they simply fix the logging issues themselves in subsequent commits. This creates a “hidden maintenance tax”—while the AI agent appears to be saving time by generating code quickly, humans are losing time “sweeping the floors” to ensure the code remains maintainable and observable.

A Failure to Follow Instructions

One might assume that simply telling an AI agent to “be more descriptive with logs” would solve the problem. The data says otherwise. Explicit logging instructions were found in only 4.7% of the tasks assigned to agents, and even when developers provided specific, constructive requests, the agents ignored them 67% of the time.

In one instance, even when agents were given “strong” instructions—specifying exactly which files to log and which frameworks to use—they only complied 27.3% of the time. This suggests a “compliance gap” where the underlying Large Language Models (LLMs) prioritize passing functional tests (making the code run) over adhering to complex, nuanced instructions about system health.

Moving Toward Guardrails

The study concludes that relying on natural language prompts to improve AI logging is a losing battle. Instead, the researchers argue for “deterministic guardrails.” This would mean integrating automated tools—like linters or CI/CD checks—that physically block an AI agent from submitting code if it doesn’t meet specific logging density or placement standards.

As we move toward a world where AI writes the bulk of our software, the message from the research is clear: speed is useless if the resulting system is a black box. Until agents learn to “log like humans,” the burden of keeping the lights on will remain firmly on human shoulders.