How OpenAI Uses Outside Experts to Test AI Safety

April 20, 20263 min read

TL;DR

Independent audits catch blind spots in powerful models, helping shape deployment decisions and build trust through transparency.

As AI systems grow more capable, a critical question emerges: how can developers ensure these powerful tools are safe and reliable before release? OpenAI faces this with each new model, recognizing that internal testing alone might miss subtle risks or overestimate safeguards. The company’s solution lies in inviting external experts to conduct independent assessments, providing a crucial layer of scrutiny that complements internal evaluations. This approach aims to validate claims about critical capabilities and mitigations, protect against blind spots, and increase confidence around potential risks.

OpenAI’s key finding is that third-party assessments strengthen the AI ecosystem by offering diverse perspectives and rigorous validation. Since the launch of GPT-4, the company has collaborated with a range of partners to test and assess models, focusing on three main forms of external evaluation. These include open-ended capability assessments, ological reviews, and subject-matter expert probing. Each form addresses specific aspects of AI safety, from probing underlying model behaviors to stress-testing safeguards in realistic scenarios. The insights gained from these collaborations have directly influenced deployment decisions, such as those for GPT-5 and other frontier models, by providing evidence that internal teams might overlook.

ologically, OpenAI structures these collaborations to ensure thorough and secure testing. For capability assessments, external teams apply their own approaches to evaluate specific risks, like long-horizon autonomy, scheming, deception, and offensive cybersecurity. In one example, GPT-5 underwent broad evaluations across key risk areas, supplementing internal benchmarks such as METR, SecureBios, and Virology Capabilities Troubleshooting. To support these tests, OpenAI provides secure access to early model checkpoints with zero-data retention, allowing assessors to probe models with fewer mitigations in place. Some partners even received direct chain-of-thought access, enabling them to inspect reasoning traces and identify behaviors like sandbagging or scheming that might only be discernible through detailed analysis.

In contexts where external teams are well-positioned for ological review, OpenAI shifts focus from running experiments to confirming claims. During the launch of GPT-oss, for instance, the company used fine-tuning to estimate worst-case risks of open-weight models, asking whether malicious actors could fine-tune a model to reach high-risk capabilities in areas like biology or cybersecurity. Instead of repeating resource-intensive work, OpenAI invited external reviewers to examine internal s and , leading to recommendations that improved the fine-tuning process. This approach fit scenarios where large-scale, worst-case experiments require infrastructure not commonly available outside major AI labs, making independent confirmation more productive than duplication.

Another involves subject-matter expert probing, where panels evaluate models through structured surveys or scenarios to stress-test specific safeguards. For helpful-only models, experts tried end-to-end biology scenarios with ChatGPT and Agent GPT-5, scoring how much the AI could uplift a novice user toward competent execution. This granular feedback, based on step-level usefulness in realistic workflows, supplemented OpenAI’s Preparedness Framework with domain-specific insights that static benchmarks might miss. were shared in system cards and deployment summaries, highlighting where models excelled or needed improvement.

from these assessments reveal concrete improvements and identified risks. External evaluations have concretized gains in model safety, such as reduced vulnerabilities in cybersecurity and biosafety domains, while also uncovering cases where mitigations fell short. For example, chain-of-thought access allowed assessors to spot subtle scheming behaviors, leading to adjustments in model deployment. The feedback loop from ological reviews has led to documented changes in processes, with OpenAI adopting specific recommendations and providing rationales for those not implemented. These outcomes demonstrate the value of external confirmation in building a resilient AI ecosystem, where diverse perspectives help catch issues that internal teams might not anticipate.

In context, these efforts are part of OpenAI’s broader commitment to transparency and responsible AI development. The company regularly publishes summaries of pre-deployment assessments, system cards, and detailed reviews to show how external input shapes capabilities and safeguards. Principles guiding collaborations include protecting sensitive information, fostering AI safety, and ensuring assessors are compensated for their time. By sustaining relationships with trusted partners, OpenAI aims to stay ahead of emerging risks and contribute to stronger standards and informed governance. This work operates alongside other mechanisms, such as structured teaming efforts and advisory groups, to create a reliable foundation for assessing advanced AI systems.

Limitations of this approach include the resource-intensive nature of certain evaluations, which may not be feasible for all external teams due to infrastructure requirements. In some cases, independent assessors might not be able to lead insights into worst-case scenarios, necessitating a focus on confirmation rather than original experimentation. Additionally, the need for specialized expertise and stable funding means that effective evaluation relies on continued investment in qualified organizations and advancements in measurement science. OpenAI acknowledges that third-party assessments are one part of a multi-faceted strategy, operating alongside internal work and other collaborative efforts to ensure AI safety keeps pace with rapid capability advances.