AI Agents Train Themselves Using Synthetic Tasks

April 20, 20262 min read

TL;DR

AutoPlay generates 30,000 training tasks without human labels, lifting agent success rates by up to 20% across benchmarks.

Training AI agents to interact with computer interfaces has long been bottlenecked by the need for human-created training data. Researchers have developed a new that could dramatically accelerate how these agents learn to navigate mobile apps and desktop software. The approach addresses a fundamental in scaling post-training for multimodal large language models (MLLMs) designed for interactive applications.

AutoPlay, the newly presented pipeline, operates through two distinct phases that work together to create diverse training tasks. First, an exploration phase uses an MLLM explorer agent to systematically uncover novel environment states and functionalities within applications. This exploration generates detailed trajectories that capture possible interactions and current state information across different software environments.

In the second phase, a task generator leverages these exploration trajectories along with task guideline prompts to synthesize environment-grounded tasks. This process creates tasks that are diverse, executable, and verifiable without requiring human annotation. The system specifically targets the creation of downstream agentic task datasets that previous approaches struggled to produce at scale.

The researchers demonstrated AutoPlay's effectiveness by generating 20,000 tasks across 20 Android applications and 10,000 tasks across 13 Ubuntu applications. These synthetic tasks enabled training of mobile-use and computer-use agents that showed significant performance improvements. The MLLM-based UI agents trained on this data achieved success rate improvements of up to 20.0 on mobile-use scenarios and 10.9 on computer-use scenarios.

Beyond supervised training, the AutoPlay-generated tasks combined with MLLM verifier-based rewards enabled scaling reinforcement learning for UI agents. This approach yielded an additional 5.7 gain in performance, showing the versatility of the synthetic task generation approach. The system's ability to create verifiable tasks without human intervention represents a substantial advancement in agent training ology.

The research establishes AutoPlay as a scalable approach for post-training capable MLLM agents while reducing reliance on human annotation. By explicitly exploring interactive environments to discover possible interactions, the pipeline addresses the coverage limitations of previous s that relied on prompting MLLMs with limited downstream environment information. This systematic exploration enables more comprehensive task generation.

While the paper demonstrates impressive across multiple application domains, the approach operates within specific constraints. focuses on UI-based interactions in controlled software environments rather than open-world scenarios. Additionally, the quality of generated tasks depends on the exploration phase's thoroughness and the underlying MLLM capabilities.

The research represents a significant step toward scalable agent training, but practical implementation would require adaptation to different interface types and interaction modalities. The current validation focuses on Android and Ubuntu applications, leaving other platforms and more complex multi-modal environments as potential areas for future work.