Live transcription in Mandarin, Spanish, and Japanese: what 2026 AI actually delivers
A field report on live transcription quality across 10 languages using Gemini 2.5 Flash. What works, what still breaks, and where to deploy it.
Live transcription in Mandarin, Spanish, and Japanese
We've been running live transcription on TA pilot sessions for 18 months. It uses Gemini 2.5 Flash via OpenRouter. Here's what the quality actually looks like across languages in 2026.
Quality tiers
Tier 1: near-perfect
- English (native speakers)
- Spanish
- French
- German
- Italian
- Portuguese
These transcribe at 95%+ accuracy even with fast speech, technical vocabulary, and moderate accents. The remaining 5% is usually proper nouns (people's names, product names) that could be fixed with a glossary.
Tier 2: very good
- Mandarin (both Traditional and Simplified)
- Japanese
- Korean
- English (non-native speakers with moderate accents)
These transcribe at 90%+ accuracy. The typical errors are:
- Chinese: wrong character with the same pronunciation, especially
in technical terms
- Japanese: missed particle distinctions (は vs が in rapid speech)
- Korean: 어 vs 오 confusion in fast speech
All correctable by a human reviewer, all pass "close enough to follow" for live captioning.
Tier 3: good
- Russian
- Polish
- Dutch
- Turkish
- Ukrainian
85%+ accuracy. Fine for casual talks, not ready for legal or medical transcription.
Tier 4: passable
- Arabic
- Hindi
- Thai
- Vietnamese
- Indonesian
70-85% accuracy depending on dialect. Usable for accessibility (better than no captions) but noticeable errors. Improving quarter-over-quarter.
Latency
End-to-end latency for any of these is around 1.2-2 seconds from speech to caption-on-screen. That's fast enough for live captioning but not quite real-time simultaneous interpretation.
Where to deploy it
High-confidence deployments
- Business meetings in Tier 1-2 languages
- University lectures
- Webinars
- Conference talks
- Podcast recordings (post-processing with a human edit)
Caution zones
- Legal proceedings (always use human interpreter)
- Medical consultations (always use human interpreter)
- Simultaneous multi-speaker transcription (quality drops with
overlapping voices — pause or use a moderator to handle turn-taking)
- Extreme technical jargon (engineering, biotech, niche academia)
without a glossary
The cost equation
Live transcription now costs a fraction of a cent per minute per listener. For a 100-person webinar running 60 minutes, that's a couple dollars. The cost stopped being a barrier years ago; the barriers now are:
- Awareness that the feature exists
- Quality at the long tail of languages
- Integration simplicity (one click, not a consultant)
TA pilot handles captioning automatically when you enable it on a session. The language you picked at session creation is the hint — so setting "Japanese" at creation gives materially better transcription than leaving "auto".
What's coming
Quality at Tier 3 and Tier 4 languages is improving roughly 10% per quarter. By end of 2026 we expect Tier 3 to be at today's Tier 1-2 quality, and Tier 4 to be at today's Tier 3.
The frontier isn't accuracy anymore — it's multi-speaker turn-taking, speaker diarization, and on-the-fly speaker adaptation.
Related reading
Add TA pilot to Chrome and you're live with a QR in under a minute.