Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

March 30, 2026

AI Summary

5 min read

Mistral researchers Pavan Kumar Reddy and Guillaume Lample discuss the release of Voxtral TTS, their first speech generation model, alongside updates on Forge for custom deployments, specialized models like Mistral Small and Leanstral, and ongoing work in audio and reasoning.

Voxtral TTS: Core Architecture and Efficiency

Voxtral TTS is a 3.4B parameter model based on Ministral, supporting speech generation in nine languages. It matches top TTS models in quality while using a fraction of the compute cost, enabling fast inference suitable for real-time voice agents. The architecture uses an in-house neural audio codec that converts 80ms audio frames into latents at 12.5 Hz: one semantic token and multiple acoustic tokens per frame.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Latent Space: The AI Engineer Podcast

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 (00:05) **Intro and Voxtral TTS Announcement** - Hosts introduce guests from Mistral and announce Voxtral, a 3B TTS model supporting 9 languages
2 (00:32) **Previous Audio Models Overview** - Recap of Mistral's ASR models like Voxtral (July) and updates (multilingual, real-time transcription)
3 (01:39) **Voxtral Architecture Deep Dive** - Novel in-house auto-regressive flow matching with neural audio codec (semantic + acoustic tokens at 12.5 Hz)
4 (04:50) **Output Generation Techniques** - Contrasts parallel prediction, depth-wise transformers vs flow matching for multi-token audio frames
5 (06:20) **Flow Matching Novelty in Audio** - First major audio application borrowed from vision; models high audio entropy (inflections, disfluencies)
6 (08:40) **Research Prioritization Strategy** - Step-by-step: ASR → real-time → TTS → full-duplex unification
7 (13:40) **Efficiency and Model Design Philosophy** - Small specialized models (e.g., 3B Voxtral, Pixtral OCR) vs large generalists for cost/latency

+ Full timestamped outline available in the app

Show Notes

Mistral has been on an absolute tear - with frequent successful model launches it is easy to forget that they raised the largest European AI round in history last year. We were long overdue for a Mistral episode, and we were very fortunate to work with Sophia and Howard to catch up with Pavan (Voxtral lead) and Guillaume (Chief Scientist, Co-founder) on the occasion of this week’s Voxtral TTS launch:

Mistral can’t directly say it, but the benchmarks do imply, that this is basically an open-weights ElevenLabs-level TTS model (Technically, it is a 4B Ministral based multilingual low-latency TTS open weights model that has a 68.4% win rate vs ElevenLabs Flash v2.5). The contributions are not just in the open weights but also in open research: We also spend a decent amount of the pod talking about their architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens (typically only applied in the Image Generation space, as seen in the Flow Matching NeurIPS workshop from the principal authors that we reference in the pod).

You can catch up on the paper here and the full episode is live on youtube!

Timestamps

00:00 Welcome and Guests00:22 Announcing Voxtral TTS01:41 Architecture and Codec02:53 Understanding vs Generation05:39 Flow Matching for Audio07:27 Real Time Voice Agents13:40 Efficiency and Model Strategy14:53 Voice Agents Vision17:56 Enterprise Deployment and Privacy23:39 Fine Tuning and Personalization25:22 Enterprise Voice Personalization26:09 Long-Form Speech Models26:58 Real-Time Encoder Advances27:45 Scaling Context for TTS28:53 What Makes Small Models30:37 Merging Modalities Tradeoffs33:05 Open Source Mission35:51 Lean and Formal Proofs38:40 Reasoning Transfer and Agents40:25 Next Frontiers in Training42:20 Hiring and AI for Science44:19 Forward Deployed Engineering46:22 Customer Feedback Loop48:29 Wrap Up and Thanks

Transcript

swyx: Okay, welcome to Latent Space. We’re here in the studio with our gues co-host Vibh u. Welcome. Thanks. Excited for this one as well as Guillaume and Pavan from Mistral. Welcome. Excited to be here.

Guillaume: Thank you.

swyx: Pavan. You are leading audio research at Ms. Charles and Guam, your chief scientist.