AMD and DigitalOcean Double AI Inference Speed

April 20, 20262 min read

TL;DR

A joint effort cuts costs and doubles throughput for Character.ai's language models, showing what hardware-software co-design can achieve.

As AI models grow larger and more complex, deploying them efficiently at scale has become a critical for companies relying on real-time inference. Character.ai, an AI entertainment platform with around 20 million users, faced this exact issue, needing to optimize GPU performance and reduce inference costs for low-latency applications. They partnered with DigitalOcean and AMD to tackle these demands, aiming to enhance production inference throughput without compromising responsiveness. This collaboration focused on the Qwen3-235B Instruct FP8 model, a Mixture-of-Experts architecture, migrated from generic GPU setups to AMD Instinct MI325X platforms on DigitalOcean.

The key finding from this effort was a doubling of production inference throughput, specifically a 2x improvement in request throughput per second under strict latency and concurrency constraints. This result was achieved while maintaining p90 first token latency and time per output token targets for a 5600/140 input/output sequence length workload. Compared to non-optimized deployments on other providers, the optimized configuration delivered a 91% increase in throughput, directly lowering cost-per-token and total cost of ownership. The success led to a multi-year, eight-figure annual agreement between Character.ai and DigitalOcean for GPU infrastructure.

Ology involved deep technical collaboration across three teams, implementing platform-level optimizations tailored to AMD hardware. They used vLLM with ROCm support for model serving, resolving initial compatibility issues like memory access faults through upstream fixes. Critical optimizations included enabling FP8 execution for both model weights and KV cache to reduce VRAM usage by 50% and leverage AMD's native FP8 support, expert parallelism to distribute 128 experts across GPUs efficiently, and custom CUDA graph compilation settings to avoid crashes and optimize performance. The team also employed tensor parallelism and data parallelism strategies, experimenting with different configurations to maximize throughput.

A pivotal innovation was shifting from a DP1/TP8/EP8 setup to a DP2/TP4/EP4 configuration on a single 8-GPU server. This change consolidated model weights onto fewer GPUs, increasing per-GPU memory pressure but allowing two independent inference groups to run concurrently. Despite each GPU handling more computational load, the reduction in communication hops and optimized topology-aware allocation via Kubernetes device plugins ensured latency targets were met. At 64 concurrency, this setup provided 2x query per second compared to the TP8 configuration and was 45% better than the optimized TP8 baseline.

Analysis shows that the DP2/TP4/EP4 configuration not only doubled throughput but also maintained p90 latency requirements, enabling high request density per node. DigitalOcean Kubernetes facilitated this with managed GPU drivers, device plugins, and NFS caching for model weights, reducing loading times by 10-15%. The optimizations underscore the importance of hardware-software co-design, where alignment between GPU interconnects, memory bandwidth, and model serving software drives efficiency. This approach allowed Character.ai to scale inference predictably across multiple servers without increasing operational burden.

In context, the study highlights foundational shifts needed for production-grade AI infrastructure, emphasizing multi-dimensional optimization of cost, latency, throughput, and concurrency. It demonstrates that deploying large-scale models requires a full-stack reevaluation beyond traditional web services, with granular observability to identify bottlenecks. The collaboration between Character.ai, AMD, and DigitalOcean showcases how strategic architectural choices can enhance performance while reducing costs, relevant for any organization scaling AI inference workloads.

Limitations of the study are acknowledged in the source material, noting that the 2x throughput improvement is based on internal testing under specific conditions with the Qwen3-235B Instruct FP8 model. Performance may vary with different models, prompt complexity, hardware availability, and network conditions. The optimizations, such as topology-aware allocation and custom parallelization strategies, are tailored to this workload and may not generalize to all deployments without similar managed support from DigitalOcean Kubernetes.