February 10, 2026·6 min read

Live transcription in Mandarin, Spanish, and Japanese: what 2026 AI actually delivers

A field report on live transcription quality across 10 languages using Gemini 2.5 Flash. What works, what still breaks, and where to deploy it.

live transcriptionMandarinSpanishJapaneseaccessibilityAI transcription

Live transcription in Mandarin, Spanish, and Japanese

We've been running live transcription on TA pilot sessions for 18 months. It uses Gemini 2.5 Flash via OpenRouter. Here's what the quality actually looks like across languages in 2026.

Quality tiers

Tier 1: near-perfect

English (native speakers)
Spanish
French
German
Italian
Portuguese

These transcribe at 95%+ accuracy even with fast speech, technical vocabulary, and moderate accents. The remaining 5% is usually proper nouns (people's names, product names) that could be fixed with a glossary.

Tier 2: very good

Mandarin (both Traditional and Simplified)
Japanese
Korean
English (non-native speakers with moderate accents)

These transcribe at 90%+ accuracy. The typical errors are:

Chinese: wrong character with the same pronunciation, especially

in technical terms

Japanese: missed particle distinctions (は vs が in rapid speech)
Korean: 어 vs 오 confusion in fast speech

All correctable by a human reviewer, all pass "close enough to follow" for live captioning.

Tier 3: good

Russian
Polish
Dutch
Turkish
Ukrainian

85%+ accuracy. Fine for casual talks, not ready for legal or medical transcription.

Tier 4: passable

Arabic
Hindi
Thai
Vietnamese
Indonesian

70-85% accuracy depending on dialect. Usable for accessibility (better than no captions) but noticeable errors. Improving quarter-over-quarter.

Latency

End-to-end latency for any of these is around 1.2-2 seconds from speech to caption-on-screen. That's fast enough for live captioning but not quite real-time simultaneous interpretation.

Where to deploy it

High-confidence deployments

Business meetings in Tier 1-2 languages
University lectures
Webinars
Conference talks
Podcast recordings (post-processing with a human edit)

Caution zones

Legal proceedings (always use human interpreter)
Medical consultations (always use human interpreter)
Simultaneous multi-speaker transcription (quality drops with

overlapping voices — pause or use a moderator to handle turn-taking)

Extreme technical jargon (engineering, biotech, niche academia)

without a glossary

The cost equation

Live transcription now costs a fraction of a cent per minute per listener. For a 100-person webinar running 60 minutes, that's a couple dollars. The cost stopped being a barrier years ago; the barriers now are:

Awareness that the feature exists
Quality at the long tail of languages
Integration simplicity (one click, not a consultant)

TA pilot handles captioning automatically when you enable it on a session. The language you picked at session creation is the hint — so setting "Japanese" at creation gives materially better transcription than leaving "auto".

What's coming

Quality at Tier 3 and Tier 4 languages is improving roughly 10% per quarter. By end of 2026 we expect Tier 3 to be at today's Tier 1-2 quality, and Tier 4 to be at today's Tier 3.

The frontier isn't accuracy anymore — it's multi-speaker turn-taking, speaker diarization, and on-the-fly speaker adaptation.