Dia TTS
Nari Labs' expressive text-to-speech with audio conditioning for emotion control, zero-shot voice variety, voice cloning with reference audio, and multi-speaker dialogue via [S1]/[S2] tags. 1.6B parameter model at $0.04 per 1,000 characters.
Model Specs
- Released
- Mar 2025
- Voices
- 1
- Max characters
- 5K
- Modalities
- textaudio
About this model
Dia TTS is Nari Labs' expressive text-to-speech model — a 1.6-billion-parameter system designed for emotional range, voice variety, and multi-speaker dialogue rather than just clean narration. The model produces a new synthetic voice with each run by default (zero-shot voice variety) but also supports voice cloning when you provide a reference audio file. Audio conditioning enables emotion control, and the model naturally produces nonverbals like laughter, sighs, and throat clearing — features that make output feel more human and less robotic.
A distinctive capability is multi-speaker dialogue: tag your text with `[S1]` and `[S2]` to generate a conversation between two speakers in a single output. Combined with notation like `(whispers)`, `(excited)`, or `(chuckles)` for emotional direction, Dia TTS is purpose-built for narrative content, character work, and dialogue-driven audio. Pricing on fal.ai is $0.04 per 1,000 characters (pay-per-use, no subscription) — sitting in the middle of Renas's TTS pricing tier (between Kokoro's $0.02 and ElevenLabs' $0.10).
Reach for Dia TTS when (a) emotional expressiveness matters more than the cheapest cost, (b) you need voice cloning from a reference sample, (c) your content is dialogue-driven (multi-speaker conversations, character interactions), or (d) natural nonverbals (laughter, breath sounds) add value to your audio. For pure narration at the lowest cost, Kokoro; for 70+ languages and 20 named voices, ElevenLabs v3; for inline emotion tags with Llama-based architecture, Orpheus.
Key Strengths
Emotion control via audio conditioning
Dia TTS generates emotionally expressive speech — adjust tone, intensity, and feeling through prompt notation. Use `(whispers)`, `(excited)`, `(chuckles)` and similar tags inline to direct emotional delivery.
Zero-shot voice variety
Each generation produces a new synthetic voice by default — useful when voice diversity matters across many generations or when you want to avoid the same voice across content. For voice consistency, supply a reference audio for cloning.
Voice cloning with reference audio
Provide a reference audio file and Dia TTS clones the voice for subsequent generations. Useful for branded narrators, character continuity across content, or replicating a specific voice characteristic.
Multi-speaker dialogue ([S1]/[S2] tags)
Tag your text with `[S1]` and `[S2]` to generate a two-speaker dialogue in a single output — useful for podcast scripts, character conversations, interview-style content, and narrative dialogue.
Natural nonverbals (laughter, throat clearing)
Dia generates organic human sounds — laughter, sighs, throat clearing, breath. Makes output feel more conversational and less robotic compared to pure-narration TTS models.
1.6B parameter expressive architecture
Substantially larger than Kokoro (82M) — the parameter count translates to expressive nuance and emotional range that smaller models can't match.
Voice synthesis capabilities
Available voices, languages, and expressive controls.
How it compares
Dia TTS occupies the expressive mid-tier. Compare on emotional capability, voice cloning, and language coverage.
| vs. Model | Verdict | Outcome |
|---|
Pros
- Emotion control via audio conditioning
- Voice cloning with reference audio
- Multi-speaker dialogue with [S1]/[S2] tags
- Natural nonverbals (laughter, throat clearing, sighs)
- 1.6B parameter expressive architecture
- Mid-tier pricing ($0.04/1K chars) — cheaper than ElevenLabs
- Pay-per-use, no subscription required
Things to consider
- Specific language list not documented in fal.ai page
- Zero-shot voice variety means voice changes each run unless cloning
- Max input characters not specified
- Less mature than ElevenLabs ecosystem (smaller voice library, less documentation)
- Emotional notation requires learning Dia-specific syntax
Best use cases
Narrative content and storytelling
Audio dramas, story-driven podcasts, character-led narration. Emotion control + nonverbals + multi-speaker dialogue together fit narrative use cases that require human-feeling delivery.
Dialogue-driven podcasts and audio
Two-speaker conversations, interview-style content, character interactions. The [S1]/[S2] tag system generates dialogue in one pass instead of stitching separate generations.
Voice cloning for branded narration
Provide a reference audio file (your brand's narrator voice, a recurring character) and Dia clones it for subsequent generations. Useful for content workflows that need voice consistency across many pieces.
Audiobook character voices
Different voices for different characters in an audiobook narrative. Combine zero-shot voice variety + multi-speaker tags + emotional notation for chapters with dialogue.
Educational content with emotional emphasis
Language learning content where emotion matters (frustration, excitement, surprise), tutorial narration that emphasizes key points, course material with engaging delivery.
Voice memo to natural narration
Provide a reference voice memo, generate clean narration in that voice for content. Useful for solo creators who want their own voice cloned without recording polished takes for every piece.
How to use it on Renas AI
- 1
Step 1
Open the AI Voice tool in TTS mode
Navigate to AI Voice in the Renas dashboard, then switch to Text-to-Speech mode. Pick Dia TTS from the model selector — it's marked as the expressive Nari Labs variant.
- 2
Step 2
Pick voice strategy
Default behavior generates a new synthetic voice each run (zero-shot variety). For voice cloning, attach a reference audio file. For multi-speaker dialogue, plan your [S1] and [S2] tag usage in the script.
- 3
Step 3
Write expressive script
Include emotion notation inline: `(whispers) That's the secret. (excited) Did you see the result?`. For dialogue, tag speakers: `[S1] What time is it? [S2] Almost three.` Natural nonverbals like (chuckles) or (sighs) add organic feel.
- 4
Step 4
Generate, review, refine
Audio output goes to your asset library. Iterate on emotional notation if delivery isn't quite right — Dia responds to specific direction more reliably than generic prompts. For long scripts, break into segments to keep the per-character cost predictable.
Pricing
Pricing on Renas AI
Pay-as-you-go credits, no API keys, no rate limits.
Frequently asked questions
Other voice models on Renas AI
Expressive AI voice with emotion + cloning
Use Dia TTS with your Renas AI subscription credits — no API key, no setup, no per-seat fees.
Try Dia TTS