Microsoft releases three AI models for speech, picture, and transcription.
Microsoft has announced three new artificial intelligence models under its Microsoft AI (MAI) family, strengthening its push into multimodal AI capabilities for developers. The newly launched models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — focus on transcription, speech generation, and image creation.
These models are now available through Microsoft Foundry and the MAI Playground. Foundry, previously known as Azure AI Studio, is the company’s unified platform for building, customizing, and scaling generative AI applications and agents. The MAI Playground serves as a public testing environment where users can experiment with features and provide feedback.
According to Mustafa Suleyman, the models have been developed with a strong focus on safety and responsible AI practices. He stated that they have undergone rigorous testing and “red-teaming,” and that Foundry includes built-in guardrails, governance tools, and enterprise-grade controls to ensure secure and compliant deployment at scale.
MAI-Transcribe-1 is a speech-to-text model that supports transcription in 25 widely used languages, including Hindi. Microsoft claims the model achieves lower word error rates compared to competing systems like Gemini and GPT-Transcribe. It can process batch transcriptions up to 2.5 times faster than the company’s earlier Azure Fast offering, with pricing starting at $0.36 per hour.
MAI-Voice-1 enables developers to create custom voices using just a few seconds of input audio. The model is capable of generating up to 60 seconds of audio in a single second, making it suitable for real-time applications. Pricing begins at $22 per one million characters.
Meanwhile, MAI-Image-2, Microsoft’s latest image generation model, is now widely available after initially being introduced in the MAI Playground. The company says it delivers at least twice the generation speed of earlier versions while maintaining output quality. Pricing starts at $5 per one million text tokens and $33 per one million image tokens.
Microsoft is also integrating these models into its existing ecosystem, including products like Copilot, Bing, and PowerPoint. With enterprise adoption already underway, the launch highlights Microsoft’s continued investment in expanding AI capabilities across multiple formats, including text, voice, and images.