AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Mapping the Path to Smarter AI Terminal Agents

In the push to create “agentic” AI—models that don’t just talk but actually do—the computer terminal remains the ultimate frontier. The command-line interface (CLI) is the universal toolkit of computing, allowing a user to do everything from deploying a website to analyzing massive datasets. However, teaching artificial intelligence to navigate this text-based world is notoriously difficult.

Current AI “terminal agents” often struggle with long, complex tasks because they lack high-quality training data. Manually writing thousands of terminal workflows is too slow, and random automated generation often produces repetitive or nonsensical tasks. To solve this, researchers from Tencent’s Hunyuan Team have unveiled SkillSynth, a framework that uses a “skill graph” to automatically generate diverse, complex, and executable terminal tasks at scale.

The Problem: Scarcity of the “Messy Middle”

Training a terminal agent is like training a chef. It’s easy to find a recipe for boiling an egg (a single command), but it’s much harder to find millions of examples of a chef navigating a busy kitchen, improvising when a stove breaks, and managing five dishes at once. Existing training sets for AI agents are either too simple or too narrow, often focusing only on basic software bug fixes.

The Innovation: The Skill Graph

SkillSynth changes the game by building a massive, interconnected map of “skills” and “scenarios.”

To understand this, imagine a graph where every node is a scenario (e.g., “A folder containing a raw video file”) and every edge connecting those nodes is a skill (e.g., “Running a command-line video analyzer”).

The researchers used Large Language Models (LLMs) to scan thousands of real-world scripts from GitHub and other repositories. They identified what needs to be true before a command is run (the pre-condition) and what happens after (the post-condition). By linking these, they created a “skill graph” containing over 82,000 scenarios and 57,000 skills.

How it Works: From Map to Mission

SkillSynth generates new training tasks in a three-step process:

  1. Path Sampling: The system picks a “route” through the graph. For example, it might start with a raw video, move to a skill that extracts frames, and then to a skill that packages those frames into a GIF.
  2. The Synthesis Harness: Once a path is chosen, a multi-agent system takes over. A “Planner” agent creates a strategy, and a “Constructor” agent builds the actual environment—the folders, the buggy code, and the specific files needed for the task.
  3. Dual Verification: To ensure the task isn’t “hallucinated” or impossible, a “Verifier” agent actually runs the intended solution. If the solution fails, the task is sent back for repairs or discarded.

Results and Impact

In a single automated run, SkillSynth produced 3,560 verified, high-quality task instances. These tasks are significantly more difficult than previous datasets; even top-tier models like Claude Opus struggled to solve many of them, requiring an average of 37 steps per task.

More importantly, the data works. When the researchers fine-tuned models like Qwen3 using SkillSynth’s trajectories, the models showed dramatic improvements on “Terminal-Bench,” a standard industry test. The researchers noted that these synthetic tasks have already been used to train Tencent’s Hy3 Preview model, enhancing its ability to handle real-world terminal workflows.

By turning the vast, messy world of the command line into a structured map, SkillSynth provides a scalable way to move AI agents beyond simple chat and into the realm of autonomous problem-solving.