New AI Model Doubles Speed for Computer Agents

April 20, 20262 min read

TL;DR

Holotron-12B cuts latency in half for interactive AI systems while holding strong on standard benchmarks, making agents faster without accuracy tradeoffs.

A new artificial intelligence model promises to make computer-using agents more efficient and capable. Holotron-12B, developed through collaboration between H Company and NVIDIA, represents a shift from models designed for static tasks to those optimized for dynamic, interactive environments. This advancement could accelerate the development of AI systems that can navigate software interfaces, complete complex workflows, and assist users in real-time applications.

Holotron-12B serves as a policy model for computer-use agents, meaning it's designed to perceive, decide, and act within interactive digital environments. Unlike most multimodal models that focus on static vision or instruction-following, this model targets the specific demands of agents that must operate efficiently in production settings. The model was created to handle long contexts with multiple images while scaling effectively in deployment scenarios.

The model's architecture combines a hybrid State-Space Model (SSM) with attention mechanisms, building on NVIDIA's Nemotron foundation. This design avoids the quadratic computation cost of traditional transformer attention, particularly benefiting workloads with lengthy interaction histories and multiple images. The SSM component dramatically reduces memory requirements by storing only a constant state per layer rather than maintaining activation caches that grow with sequence length.

Training occurred in two distinct phases, beginning with the open NVIDIA Nemotron-Nano-12B-v2-VL-BF16 multimodal base model. Researchers then performed supervised fine-tuning on H Company's proprietary data mixture focused on localization and navigation tasks. This training emphasized screen understanding, grounding, and UI-level interactions, with the final checkpoint trained on approximately 14 billion tokens of data.

Performance testing revealed significant improvements across multiple metrics. On the WebVoyager Benchmark, which simulates real-world multimodal agent workloads, Holotron-12B achieved a score of 80.5, more than doubling the base Nemotron model's performance of 35.1. This result also exceeded the performance of the previous Holo2-8B model, demonstrating the new model's effectiveness in agentic settings.

Inference efficiency showed particularly dramatic gains. Running on a single H100 GPU with vLLM optimizations, Holotron-12B achieved over 2x higher throughput compared to Holo2-8B. In controlled experiments, total token throughput reached 8.9k tokens per second at maximum concurrency of 100 benchmark workers, while Holo2-8B plateaued at 5.1k tokens per second. This scaling efficiency stems from better VRAM utilization and smaller memory footprint.

The model also showed substantial improvements on localization and grounding benchmarks including OS-World-G, GroundUI, and WebClick. These indicate Holotron-12B's enhanced ability to understand screen elements and interact with user interfaces precisely. The combination of strong benchmark performance and high throughput makes the model suitable for throughput-bound applications like data generation, annotation, and online reinforcement learning.

Researchers acknowledge that future improvements could focus on higher-resolution vision training. The model demonstrates that the NVIDIA Nemotron VL architecture provides a strong foundation for real-world multimodal agents when paired with appropriate training data and infrastructure. Holotron-12B is now available on Hugging Face under an NVIDIA Open Model License for further development and deployment.

Looking forward, the development team is preparing to post-train the next generation of models based on the newly announced Nemotron 3 Omni architecture. This evolution aims to deliver greater reasoning capabilities and multimodal precision while maintaining the high-throughput, low-latency performance required for commercial-scale autonomous computer use deployments.