How Slonk Connects HPC and Kubernetes for AI Research

April 20, 20262 min read

TL;DR

Character.ai's hybrid system resolves conflicts between research and ops teams, speeding up model training while keeping cloud flexibility.

In the competitive landscape of artificial intelligence development, infrastructure decisions can make or break research productivity. Character.ai has revealed how they solved one of the most persistent s in machine learning infrastructure: reconciling the traditional high-performance computing environment researchers prefer with the modern cloud orchestration that operations teams require. Their solution, called Slonk (SLURM on Kubernetes), represents a pragmatic approach to a problem that has plagued many organizations scaling their AI training capabilities.

At its core, Slonk addresses a fundamental tension in research organizations. Researchers typically want SLURM, the established scheduler used in supercomputing environments that provides fair queues, gang scheduling, and familiar workflows. Meanwhile, infrastructure teams need Kubernetes for its orchestration capabilities, health checks, autoscaling, and operational stability. Character.ai faced this exact dilemma when scaling their training infrastructure, with researchers demanding simplicity and speed while operations required efficient GPU sharing and system reliability.

The technical approach is elegantly straightforward. Slonk treats SLURM nodes as long-running Kubernetes pods, creating three StatefulSets for controller, worker, and login functions. Each SLURM node maps directly to a pod, with controller pods managing scheduling, worker pods handling computation, and login pods providing SSH access and a familiar research environment. This architecture allows other workloads to coexist on the same physical machines while maintaining the traditional supercomputing cluster experience researchers expect.

Researchers continue working with their established workflows: SSH to a login node, edit code on shared NFS home directories, submit jobs, and monitor logs. Behind the scenes, Slonk's controller schedules and allocates resources, with returning to the same shared volumes. For specialized hardware like TPUs and slice-based systems, the system leverages SLURM's network topology awareness to ensure allocations are co-located, enabling jobs to start in seconds rather than minutes through pre-staged cluster capacity.

The system incorporates robust failure recovery mechanisms. When a researcher or automated system marks a SLURM node as faulty, Slonk automatically drains the corresponding Kubernetes node and restarts its virtual machine at the cloud provider level. Nodes that repeatedly fail health checks are excluded from the SLURM pool to maintain job stability, while an observability system tracks all faulty nodes for investigation and long-term reliability improvements.

Slonk provides consistent cluster management across different cloud environments. While managed SLURM setups often vary by operating system, drivers, or monitoring tools, Slonk delivers a uniform environment with the same CUDA stack and observability everywhere. This consistency allows GPU resources to shift dynamically between training and inference workloads simply by adjusting StatefulSet replicas, with Kubernetes enabling production workloads to preempt training when necessary.

The result is a hybrid system that maintains the simplicity researchers need while providing the operational benefits infrastructure teams require. Researchers submit jobs using familiar SLURM commands while Kubernetes quietly handles node restarts, container health monitoring, and logging. SLURM manages job scheduling and quotas, while Kubernetes ensures operational stability, creating a system that remains simple, reliable, and flexible under varying workloads.

Character.ai emphasizes that Slonk represents a reference implementation rather than a fully supported open-source project. They encourage other organizations to fork, build upon, and adapt the architecture to their specific environments. The company notes they're actively hiring machine learning infrastructure engineers interested in the intersection of high-performance computing and cloud technologies, reflecting their belief that the best infrastructure is the kind researchers never have to think about.