VibeVoice
Open-source voice AI that generates 90-minute multi-speaker podcasts from text
About
VibeVoice is Microsoft's open-source voice AI family that rewrites the rules for text-to-speech. The headline capability: generate up to 90 minutes of natural, multi-speaker conversational audio from plain text. Four distinct speakers. Natural turn-taking. Expressive intonation. All running locally on consumer hardware. The technical architecture is genuinely novel. VibeVoice uses continuous speech tokenizers operating at an ultra-low 7.5 Hz frame rate — dramatically more efficient than competing TTS systems. A next-token diffusion framework combines an LLM backbone (based on Qwen2.5 1.5B) with a diffusion head for acoustic detail generation. The result sounds remarkably natural for a 1.5B parameter model. Three model variants cover different use cases. VibeVoice-TTS (1.5B) handles long-form multi-speaker synthesis up to 90 minutes. VibeVoice-Realtime (0.5B) delivers streaming TTS with approximately 300ms first-audible latency and 20+ voices across 9+ languages. VibeVoice-ASR (7B) does the reverse — transcribing up to 60 minutes of audio in a single pass with speaker diarization, timestamps, and hotword support for 50+ languages. The model weights are available on Hugging Face under MIT license. However, Microsoft explicitly recommends research and development use only — commercial deployment requires additional validation. The TTS inference code is currently disabled as a responsible use measure, though the Realtime and ASR variants have full inference available via Gradio playgrounds and Colab notebooks. VibeVoice competes with ElevenLabs and Voxtral TTS, but with a critical difference: it is fully open-source and runs offline. No API costs. No per-character billing. No data leaving your machine. For researchers, indie developers, and teams prototyping podcast or audiobook workflows, this changes the cost equation completely. Since its release, VibeVoice has accumulated over 27,000 GitHub stars. The ASR variant is seeing rapid community adoption, with the Vibing voice input method built directly on top of it. Integration with Hugging Face Transformers and vLLM further lowers the barrier to getting started. Related: See our coverage of best AI text-to-speech tools and the VibeVoice repository for setup instructions.
Key Features
- 90-minute multi-speaker conversational audio generation with up to 4 distinct speakers
- Ultra-low 7.5 Hz frame rate for efficient speech tokenization
- Realtime variant with ~300ms first-audible latency for streaming applications
- ASR model transcribes 60 minutes of audio in a single pass with speaker diarization
- 50+ language support for speech recognition, 9+ for realtime TTS
- Runs offline on consumer hardware — no API costs or data leaving your machine
- Hugging Face Transformers and vLLM integration for optimized inference
- Hotword customization for domain-specific transcription accuracy
Use Cases
- 1Generating podcast episodes from text scripts without recording equipment
- 2Creating audiobook narration with multiple character voices
- 3Building real-time voice interfaces for applications and chatbots
- 4Transcribing long meetings with speaker identification and timestamps
- 5E-learning content narration with natural conversational delivery
- 6Prototyping voice AI products without API subscription costs
Pros
- Completely free and open-source under MIT license — no per-character billing
- 90-minute generation far exceeds most TTS tools' duration limits
- Three specialized variants (TTS, Realtime, ASR) cover the full speech pipeline
- Runs locally with no data leaving your machine — strong privacy story
- 27K+ GitHub stars and active community adoption signal production readiness for research use
Cons
- TTS inference code currently disabled by Microsoft as a responsible use measure
- Explicitly not recommended for commercial deployment without additional validation
- 1.5B model requires decent GPU — not practical on low-end laptops
- English and Chinese are primary languages; other language quality varies
- No hosted API — you must self-host and manage infrastructure
Details
- Category
- audio
- Pricing
- Free (Open Source, M