Skip to main content
·6 min read

Live transcription in Mandarin, Spanish, and Japanese: what 2026 AI actually delivers

A field report on live transcription quality across 10 languages using Gemini 2.5 Flash. What works, what still breaks, and where to deploy it.

live transcriptionMandarinSpanishJapaneseaccessibilityAI transcription

Live transcription in Mandarin, Spanish, and Japanese

We've been running live transcription on TA pilot sessions for 18 months. It uses Gemini 2.5 Flash via OpenRouter. Here's what the quality actually looks like across languages in 2026.

Quality tiers

Tier 1: near-perfect

  • English (native speakers)
  • Spanish
  • French
  • German
  • Italian
  • Portuguese

These transcribe at 95%+ accuracy even with fast speech, technical vocabulary, and moderate accents. The remaining 5% is usually proper nouns (people's names, product names) that could be fixed with a glossary.

Tier 2: very good

  • Mandarin (both Traditional and Simplified)
  • Japanese
  • Korean
  • English (non-native speakers with moderate accents)

These transcribe at 90%+ accuracy. The typical errors are:

  • Chinese: wrong character with the same pronunciation, especially

in technical terms

  • Japanese: missed particle distinctions (は vs が in rapid speech)
  • Korean: 어 vs 오 confusion in fast speech

All correctable by a human reviewer, all pass "close enough to follow" for live captioning.

Tier 3: good

  • Russian
  • Polish
  • Dutch
  • Turkish
  • Ukrainian

85%+ accuracy. Fine for casual talks, not ready for legal or medical transcription.

Tier 4: passable

  • Arabic
  • Hindi
  • Thai
  • Vietnamese
  • Indonesian

70-85% accuracy depending on dialect. Usable for accessibility (better than no captions) but noticeable errors. Improving quarter-over-quarter.

Latency

End-to-end latency for any of these is around 1.2-2 seconds from speech to caption-on-screen. That's fast enough for live captioning but not quite real-time simultaneous interpretation.

Where to deploy it

High-confidence deployments

  • Business meetings in Tier 1-2 languages
  • University lectures
  • Webinars
  • Conference talks
  • Podcast recordings (post-processing with a human edit)

Caution zones

  • Legal proceedings (always use human interpreter)
  • Medical consultations (always use human interpreter)
  • Simultaneous multi-speaker transcription (quality drops with

overlapping voices — pause or use a moderator to handle turn-taking)

  • Extreme technical jargon (engineering, biotech, niche academia)

without a glossary

The cost equation

Live transcription now costs a fraction of a cent per minute per listener. For a 100-person webinar running 60 minutes, that's a couple dollars. The cost stopped being a barrier years ago; the barriers now are:

  1. Awareness that the feature exists
  2. Quality at the long tail of languages
  3. Integration simplicity (one click, not a consultant)

TA pilot handles captioning automatically when you enable it on a session. The language you picked at session creation is the hint — so setting "Japanese" at creation gives materially better transcription than leaving "auto".

What's coming

Quality at Tier 3 and Tier 4 languages is improving roughly 10% per quarter. By end of 2026 we expect Tier 3 to be at today's Tier 1-2 quality, and Tier 4 to be at today's Tier 3.

The frontier isn't accuracy anymore — it's multi-speaker turn-taking, speaker diarization, and on-the-fly speaker adaptation.

Related reading


Ready to run your own live Q&A?

Add TA pilot to Chrome and you're live with a QR in under a minute.