VibeVoice

Open-source voice AI that generates 90-minute multi-speaker podcasts from text

audioFree (Open Source, Mtext-to-speechvoice AIopen sourceMicrosoftpodcast generationspeech recognitionTTS

About

VibeVoice is Microsoft's open-source voice AI family that rewrites the rules for text-to-speech. The headline capability: generate up to 90 minutes of natural, multi-speaker conversational audio from plain text. Four distinct speakers. Natural turn-taking. Expressive intonation. All running locally on consumer hardware. The technical architecture is genuinely novel. VibeVoice uses continuous speech tokenizers operating at an ultra-low 7.5 Hz frame rate — dramatically more efficient than competing TTS systems. A next-token diffusion framework combines an LLM backbone (based on Qwen2.5 1.5B) with a diffusion head for acoustic detail generation. The result sounds remarkably natural for a 1.5B parameter model. Three model variants cover different use cases. VibeVoice-TTS (1.5B) handles long-form multi-speaker synthesis up to 90 minutes. VibeVoice-Realtime (0.5B) delivers streaming TTS with approximately 300ms first-audible latency and 20+ voices across 9+ languages. VibeVoice-ASR (7B) does the reverse — transcribing up to 60 minutes of audio in a single pass with speaker diarization, timestamps, and hotword support for 50+ languages. The model weights are available on Hugging Face under MIT license. However, Microsoft explicitly recommends research and development use only — commercial deployment requires additional validation. The TTS inference code is currently disabled as a responsible use measure, though the Realtime and ASR variants have full inference available via Gradio playgrounds and Colab notebooks. VibeVoice competes with ElevenLabs and Voxtral TTS, but with a critical difference: it is fully open-source and runs offline. No API costs. No per-character billing. No data leaving your machine. For researchers, indie developers, and teams prototyping podcast or audiobook workflows, this changes the cost equation completely. Since its release, VibeVoice has accumulated over 27,000 GitHub stars. The ASR variant is seeing rapid community adoption, with the Vibing voice input method built directly on top of it. Integration with Hugging Face Transformers and vLLM further lowers the barrier to getting started. Related: See our coverage of best AI text-to-speech tools and the VibeVoice repository for setup instructions.

Key Features

90-minute multi-speaker conversational audio generation with up to 4 distinct speakers
Ultra-low 7.5 Hz frame rate for efficient speech tokenization
Realtime variant with ~300ms first-audible latency for streaming applications
ASR model transcribes 60 minutes of audio in a single pass with speaker diarization
50+ language support for speech recognition, 9+ for realtime TTS
Runs offline on consumer hardware — no API costs or data leaving your machine
Hugging Face Transformers and vLLM integration for optimized inference
Hotword customization for domain-specific transcription accuracy

Use Cases

1Generating podcast episodes from text scripts without recording equipment
2Creating audiobook narration with multiple character voices
3Building real-time voice interfaces for applications and chatbots
4Transcribing long meetings with speaker identification and timestamps
5E-learning content narration with natural conversational delivery
6Prototyping voice AI products without API subscription costs

Pros

Completely free and open-source under MIT license — no per-character billing
90-minute generation far exceeds most TTS tools' duration limits
Three specialized variants (TTS, Realtime, ASR) cover the full speech pipeline
Runs locally with no data leaving your machine — strong privacy story
27K+ GitHub stars and active community adoption signal production readiness for research use

Cons

TTS inference code currently disabled by Microsoft as a responsible use measure
Explicitly not recommended for commercial deployment without additional validation
1.5B model requires decent GPU — not practical on low-end laptops
English and Chinese are primary languages; other language quality varies
No hosted API — you must self-host and manage infrastructure

Get Started

4.2

Visit Website

Details

Category: audio
Pricing: Free (Open Source, M

Related Resources

Latest News

Read the latest articles and reviews about VibeVoice

Open-Source Alternatives

Explore open-source repositories and MCP servers