Back to Phantom Notes
AI Models

Gemini 3.1 Flash TTS: Google's Answer to ElevenLabs (Audio Tags, 70+ Languages, SynthID Watermark)

April 18, 20267 min readBy T.W. Ghost
GeminiTTSGoogle AIElevenLabsOpenAI TTSVoice AIAudio TagsSynthID

The News

On April 15, 2026, Google launched Gemini 3.1 Flash TTS in preview across the Gemini API, Google AI Studio, and Vertex AI. Consumers get it via Google Vids through Workspace.

It's the first text-to-speech model from Google that's positioned directly against ElevenLabs and OpenAI TTS as a production-ready voice generation tool, not just a research demo.

Three things make this launch worth paying attention to:

  • Audio tags for natural-language voice direction embedded in the text prompt
  • 70+ languages with native multi-speaker dialogue
  • SynthID watermarking baked into every output

The Artificial Analysis TTS leaderboard scores it at Elo 1,211, placing it in the "ideal blend of high quality + low cost" quadrant. That's Google's positioning language, but it lines up with what early testers are saying.


What's Actually New: Audio Tags

This is the feature that changes the workflow.

Most TTS systems (ElevenLabs, OpenAI TTS, Azure Speech) give you voice selection plus numeric sliders for speed, pitch, and style. You configure the voice once and it applies uniformly to the entire output.

Gemini 3.1 Flash TTS lets you embed stage directions inline using natural-language audio tags. Think of it as writing a script with director notes:

[whisper excitedly] Can you believe this just happened? [normal] The data came in overnight and the results are clear.

The model interprets the bracketed directions and adjusts delivery mid-sentence. You can specify:

  • Vocal style (whisper, excited, calm, sarcastic)
  • Pacing (slow, rushed)
  • Accent variations
  • Mid-sentence expression pivots
  • Scene direction for multi-speaker dialogues

For anyone producing narrated content (tutorials, audiobooks, podcasts, video voiceovers), this is closer to how voice actors actually work. The alternative on ElevenLabs is regenerating segments with different settings and splicing them together.


How It Compares

vs ElevenLabs

FactorGemini 3.1 Flash TTSElevenLabs
Quality (Artificial Analysis Elo)1,2111,200s typical top tier
Languages70+29+
Voice cloningNot detailed in launch notesYes, instant + professional tiers
Multi-speaker dialogueNativeSupported via Studio
Control mechanismInline audio tagsSettings sliders + tags
WatermarkSynthID mandatoryOptional
PricingNot yet published$5-$330/mo tiered
API accessYes (Gemini API + Vertex)Yes

vs OpenAI TTS

FactorGemini 3.1 Flash TTSOpenAI TTS-1-HD
Voice countNot specified6 voices (alloy, echo, fable, onyx, nova, shimmer)
Inline directionYes (audio tags)No
Multi-speakerNativeNot supported
Language count70+Multilingual but English-dominant
WatermarkSynthIDNone
IntegrationGemini APIOpenAI Python/Node SDK

The short version: if you want multi-speaker dialogue with inline direction, Gemini is the only major option. If you want voice cloning, ElevenLabs remains ahead. If you want simplicity, OpenAI TTS stays the lightest integration.


The SynthID Watermark: Double-Edged

Every audio output from Gemini 3.1 Flash TTS includes a SynthID watermark baked into the waveform. Google describes it as "imperceptible" and it's designed to survive compression, editing, and re-encoding.

Why it matters positively:

  • Professional credibility for AI-generated content
  • Platform compatibility as YouTube, TikTok, and others roll out SynthID detection
  • Legal defensibility (clear AI provenance)
  • Anti-deepfake posture

Why it might matter negatively:

  • Some production pipelines flag watermarked audio for "authenticity concerns"
  • Clients who want "undetectable" AI voiceover (ethically questionable use case) will avoid it
  • Unknown whether advanced audio processing strips the watermark (Google says no, but it's new)

Our take: for legitimate content production, SynthID is a feature, not a bug. For grey-market uses, Gemini won't be the tool of choice.

Get the Weekly IT + AI Roundup

What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.

No spam, unsubscribe anytime. Privacy Policy


Pricing Reality

Google has not published per-token pricing for Gemini 3.1 Flash TTS at launch. The Flash line historically sits at the cheapest tier of Gemini models, so we expect pricing well below ElevenLabs' scale tiers.

For reference:

  • ElevenLabs Creator: $22/mo for 100K characters
  • ElevenLabs Pro: $99/mo for 500K characters
  • OpenAI TTS-1-HD: $30 per 1M characters
  • Gemini Flash previous TTS (now deprecated): was $0.0075 per 1K characters

If Gemini 3.1 Flash TTS lands near Flash 2.0 pricing, it could undercut ElevenLabs by 5-10x for bulk voiceover production.

We'll update this post once official pricing is published.


Who Should Try It First

Video content creators producing regular narration across multiple styles. The audio tag system lets one voice cover "excited tutorial intro" and "calm technical explanation" without regenerating takes.

Podcast producers running multi-speaker shows where native dialogue support saves splicing time.

Global content teams serving 10+ languages. 70+ language coverage leapfrogs most competitors.

Developers building TTS into products where SynthID compliance aligns with Play Store / App Store policies around AI disclosure.

Anyone on ElevenLabs Pro or Enterprise who's paying $99+/mo for character quota. Worth running the same script through Gemini and comparing output before the next renewal.


Who Should Stick With Their Current TTS

Voice cloning workflows — ElevenLabs still leads, and instant voice cloning is unmatched.

Simple one-voice narration — OpenAI TTS-1-HD is already good enough and trivially easier to integrate.

Workflows where SynthID watermarking is disallowed — legal restrictions vary by jurisdiction and platform.


The Bigger Picture

Google has spent two years being second-best in commercial TTS behind ElevenLabs and tied with OpenAI. Gemini 3.1 Flash TTS is the first model where Google leads on three factors simultaneously: language count, inline control, and native multi-speaker support.

The ElevenLabs moat isn't broken yet. Voice cloning, brand voice libraries, and conversational AI integrations still differentiate ElevenLabs for enterprise buyers. But the middle tier (individual creators, content agencies, small media companies paying $22-99/mo) now has a legitimate alternative.

We expect 2026 to be the year Gemini TTS closes the quality gap and forces ElevenLabs to compete on price, not just features.


Get Started

Gemini 3.1 Flash TTS is in preview. Access requires:

  • Developers: Google AI Studio (free tier) or Gemini API
  • Enterprises: Vertex AI
  • Consumers: Google Vids (Workspace)

Want the full deep dive on Gemini's AI ecosystem including Gemini 2.5 Flash, Deep Research, NotebookLM, and Audio Overviews? Our Gemini Pro Track covers the complete model family across 6 modules.

Already on ElevenLabs and wondering when to switch? Our ElevenLabs Pro Track covers voice cloning workflows, Studio multi-speaker production, and integration patterns, including when to keep ElevenLabs and when to swap out.

Not sure which AI tools fit your workflow? Take the free quiz and get matched in 2 minutes.