Character.ai Shares 5 AI Training Tricks That Cut Costs

April 20, 20262 min read

TL;DR

Five once-private techniques that made large-scale AI training faster and cheaper, helping researchers beat bandwidth and compute limits.

Character.ai, known for its conversational AI systems, has publicly detailed for the first time several optimization techniques developed during its early large-scale pretraining efforts. These s, created when the company was building its own foundation models, address fundamental s in training massive AI systems efficiently. While Character.ai has since shifted to working with open-source models, these innovations remain relevant for anyone training large neural networks, particularly in resource-constrained environments.

The techniques include Squinch, a 6-bit gradient compression algorithm invented by cofounder Noam Shazeer, which dramatically reduces communication bandwidth between computing nodes during distributed training. At the time of development, Character.ai's largest pretraining cluster operated with only one-quarter of the bandwidth of state-of-the-art systems. Squinch enabled efficient training under these constraints by compressing gradient information while maintaining the same model accuracy as training with higher-precision bfloat16 gradients.

Squinch works by block-wise quantizing gradients to 6 bits per element, with each block encoding eight gradient values into a compact 48-bit representation. Unlike general quantization schemes, Squinch's dynamic range is specifically tuned to transformer gradients, which tend to fall within a well-regularized distribution. This specialization allowed the algorithm to achieve lower communication costs with negligible loss in training fidelity when used on properly regularized transformer models.

Another key technique, Attention Z-Reg, is a regularization applied to attention logits to keep their numerical range well-behaved during training. It shifts logits so that the summed activation remains close to zero, allowing optimization to use the high-precision range of bfloat16 representation. This matters because the numeric resolution of bfloat16 decreases at large magnitudes—the floating point steps between 40 and 41 are far greater than between 0 and 1, making precise optimization difficult with large values.

Dynamic Clamping addresses quantization errors during training by preventing small activation values from collapsing to zero. In quantization-aware training, when activation values get extremely small, most values get quantized to the same number, harming training stability and accuracy. Dynamic Clamping solves this by calculating clamping limits based on the standard deviation of input values rather than using constant limits, greatly reducing quantization errors and improving training stability.

The visibility mask technique provides a compact way to represent relationships between different parts of input data during attention computation. It uses two tensors—start and limit—that describe, for each token, the valid attention range during both training and inference. This allows efficient handling of complex data structures like multiple independent documents, tree-structured documents, or beam search implementations in inference with empty slots in paged attention.

For model distillation, Character.ai developed a subsampling that reduces storage requirements while maintaining fidelity to teacher models. When storing teacher model outputs for offline distillation, the large vocabulary size makes storage expensive. Their randomly samples a subset of the vocabulary using Gumbel top-k sampling while preserving the expected values of soft targets, substantially cutting storage and bandwidth costs for offline distillation runs.

These techniques evolved through practical s in scaling conversational model pretraining, with each optimization reflecting Character.ai's engineering philosophy that small, precise improvements compound into major efficiency gains at scale. While the company no longer does large-scale pretraining, it's applying these optimization capabilities to post-training reinforcement learning efforts on open-source models. s demonstrate how targeted engineering solutions can overcome specific bottlenecks in AI training pipelines.