Character.ai's Kaiju LLM Family Reveals Radical Efficiency Architecture for Conversational AI
November 13, 2025 · 2 min read
Character.ai has pulled back the curtain on Kaiju, its proprietary family of large language models built specifically for high-performance conversational AI. The revelation comes as the company shifts toward open-source foundation models, showcasing years of internal research that prioritizes inference efficiency and engagement over traditional benchmark performance.
The Kaiju models represent a significant departure from conventional LLM design philosophy. Available in three sizes—Small (13B parameters), Medium (34B), and Large (110B)—these dense transformer architectures incorporate multiple cutting-edge optimization techniques including int8 quantization, multi-query attention, sliding-window attention, and cross-layer cache sharing. This architectural approach reflects Character.ai's focus on real-world deployment rather than academic metrics.
At the core of Kaiju's efficiency gains is multi-query attention (MQA), which dramatically reduces the per-token KV cache size. While MQA is known to negatively impact some artificial general intelligence benchmarks like MMLU, Character.ai's engineering team determined the inference efficiency benefits outweighed the minimal quality impact for conversational workloads. The models also implement sliding window attention with a 1024-token window length, interleaved with global attention layers in a 5:1 ratio to maintain long-context capabilities.
The optimization extends to the training infrastructure, where Kaiju models were trained entirely on H100 GPUs in Google Cloud Platform clusters using advanced model parallelism. Character.ai employed quantization-aware training, allowing models to maintain bf16-level accuracy while training 20-30% faster. The company developed novel techniques like Squinch for 6-bit gradient compression and virtual scalars to stabilize int8 training.
Safety and alignment receive significant attention in Kaiju's design. The models undergo a multi-phase safety process and include an optional classifier head that outputs token-level safety metrics. Character.ai implements classifier-guided beam search, using these safety assessments to influence token sampling during inference.
Character.ai's technical approach demonstrates that production performance requirements can drive fundamental architectural choices. The combination of int8 QAT, MQA, and KV cache sharing collectively reduces inference memory and computational costs by orders of magnitude, enabling the large-scale deployment necessary for Character.ai's conversational platform.
As Character.ai transitions toward open-source LLMs, the Kaiju research provides valuable insights into optimizing models for specific use cases rather than general benchmarks. The company continues to recruit engineers and researchers focused on advancing large-scale, human-centered machine learning systems.