Microsoft Releases 3 AI Models That Beat OpenAI and Google

April 20, 20262 min read

TL;DR

Microsoft's MAI Lab launches speech, voice, and image models that outperform rivals on key benchmarks, stepping up its AI competition directly.

Microsoft fired its most significant shot yet in the foundational AI model race on Wednesday, announcing three new models developed entirely in-house by its Microsoft AI research division. The release of MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 represents a decisive shift for a company that has, until now, been best known in the AI space as OpenAI's largest investor and cloud partner rather than a frontline model builder.

The flagship release is MAI-Transcribe-1, a speech-to-text model that Microsoft says achieves the lowest average Word Error Rate — 3.8 percent — on the widely used FLEURS multilingual benchmark across the top 25 languages by Microsoft product usage. According to the company's internal testing, the model ranks first in 11 core languages and outperforms OpenAI's Whisper-large-v3 on the remaining 14. It also beats Google's Gemini 3.1 Flash on 11 of those 14 languages. Batch transcription runs 2.5 times faster than Microsoft's own Azure Fast offering, and pricing starts at $0.36 per hour — undercutting established competitors.

MAI-Voice-1, the company's new text-to-speech model, can generate 60 seconds of natural-sounding audio in a single second, with what Microsoft describes as emotional range, nuanced expression, and speaker identity preservation across long-form content. The model supports custom voice creation from just a few seconds of reference audio through Microsoft Foundry, priced at $22 per million characters. The combination of speed, quality, and low-friction voice cloning positions it as a direct challenger to offerings from ElevenLabs and Google.

Rounding out the trio is MAI-Image-2, which debuted as a top-three model family on the Arena.ai leaderboard and delivers at least twice the generation speed of its predecessor. Rob Reilly, Global Chief Creative Officer at advertising giant WPP, called the model "a genuine game-changer" that "responds to creative direction and generates campaign-ready images." The model is already deployed across Microsoft's consumer products, including Copilot, Bing, and PowerPoint, with pricing set at $5 per million tokens for text input and $33 per million tokens for image output.

All three models are available immediately through Microsoft Foundry, the company's enterprise AI platform, as well as a new MAI Playground currently limited to U.S. users. The aggressive pricing across the board signals that Microsoft intends to compete not just on capability but on cost — a strategy that could pressure margins across the industry and accelerate enterprise adoption.

The timing is notable. The MAI division, led by CEO Mustafa Suleyman, formed its Superintelligence research team just six months ago in November 2025. The rapid pace from team formation to shipping production-grade models suggests Microsoft has been building foundational model expertise for longer than the organizational timeline implies, likely drawing on years of research investment and infrastructure development across Azure and its broader AI research labs.

The strategic implications extend well beyond benchmark numbers. For years, Microsoft's AI narrative centered on its multibillion-dollar partnership with OpenAI, which gave the Redmond giant exclusive cloud hosting rights and deep integration into products like Copilot. By launching competitive — and in some cases superior — in-house models, Microsoft is hedging against its dependence on a single external partner while building optionality into its AI stack. For OpenAI and Google, the message is unmistakable: Microsoft now intends to compete at every layer of the AI value chain, from infrastructure to foundational models to end-user applications.