Mistral's open-weight text-to-speech model that beats ElevenLabs on naturalness at a fraction of the cost
Voxtral TTS is Mistral AI's first text-to-speech model, released March 26, 2026. It is a 4B parameter open-weight model that generates human-quality speech from text across nine languages. The model architecture splits into three components: a 3.4B transformer decoder backbone, a 390M acoustic transformer, and a 300M neural audio codec. What makes Voxtral stand out is the combination of quality and accessibility. Human evaluations show it produces more natural-sounding speech than ElevenLabs Flash v2.5, and matches ElevenLabs v3 quality — while the weights are freely available on HuggingFace. You can run it on a consumer GPU or laptop, or use Mistral's API at $0.016 per 1,000 characters. Voice cloning requires just 3 seconds of reference audio. The model captures accent, inflections, intonations, and even speech disfluencies from that tiny sample, then applies them across any of the nine supported languages. Zero-shot cross-lingual adaptation means you can clone an English voice and have it speak fluent French with the original speaker's characteristics preserved. Latency is 70ms for typical inputs (500 characters generating a 10-second clip), with a real-time factor of approximately 9.7x. The API handles arbitrarily long content through smart interleaving, and the model natively generates up to 2 minutes of audio per request. The open-weight release under CC BY NC 4.0 means researchers and hobbyists can run it locally, fine-tune it, and integrate it into non-commercial projects. For commercial use, the API is the intended path at $0.016 per 1K characters — roughly 10x cheaper than ElevenLabs' standard pricing tier. For developers building voice applications, accessibility tools, or content creation pipelines, Voxtral is the first serious open-weight alternative to proprietary TTS services. The quality-to-cost ratio is unprecedented in the TTS market.