OpenAI Safety Models Now Let You Define Custom Rules

April 20, 20262 min read

TL;DR

The gpt-oss-safeguard models use reasoning to apply your own policies, giving more flexibility at a higher compute cost.

OpenAI has released a research preview of gpt-oss-safeguard, open-weight reasoning models designed for safety classification tasks. Available in two sizes—120 billion and 20 billion parameters—these models are fine-tuned versions of the company's existing gpt-oss open models and can be downloaded from Hugging Face under the Apache 2.0 license. This move allows developers to freely use, modify, and deploy the technology without restrictions.

The models employ reasoning to directly interpret developer-provided policies during inference, classifying user messages, completions, and chats based on custom needs. Developers can review the chain-of-thought reasoning to understand how decisions are made, and since policies are applied at runtime rather than baked into training, it's easier to iterate and refine them. This approach, initially developed for internal use, offers greater flexibility than traditional classifiers that infer rules from large labeled datasets.

For example, a gaming forum could use gpt-oss-safeguard to flag posts about cheating, while a review site might screen for fake content. The model takes both a policy and the content to classify, outputting a conclusion with reasoning, which developers can integrate into their safety pipelines. OpenAI notes this reasoning-based method excels in scenarios requiring nuanced policy application, though it involves higher computational demands.

The release follows OpenAI's internal tool called Safety Reasoner, which enables dynamic policy updates in production faster than retraining classifiers. In recent launches, safety reasoning has consumed up to 16% of total compute, highlighting its role in iterative deployment. For systems like GPT-5 and image generators, it provides real-time evaluations to block unsafe outputs, forming part of a multi-layered defense strategy.

Evaluations show gpt-oss-safeguard models outperformed gpt-5-thinking and other open models on multi-policy accuracy tasks, a surprising result given their smaller size. On external benchmarks like ToxicChat, they held their own but were slightly edged out by internal tools. Industry analysts suggest this performance makes them suitable for tasks where customizability trumps raw speed.

However, the models have limitations: dedicated classifiers trained on vast datasets can achieve higher accuracy for complex risks, and the reasoning process is resource-intensive, posing scaling challenges. OpenAI mitigates this internally by using faster classifiers for initial screening and asynchronous processing to maintain low latency while ensuring safety.

Built with community input from partners like ROOST, SafetyKit, and Discord, gpt-oss-safeguard represents OpenAI's first open safety models. ROOST's CTO emphasized the 'bring your own policies' design, enabling organizations to innovate freely. A new model community launched today aims to foster collaboration on open AI safety tools, with further iterations planned based on feedback.