If you've ever shipped a multilingual Next.js app, you know the part nobody talks about: translation quality at scale is a genuine engineering problem, not a content problem.
You can wire up next-intl or i18next in an afternoon. The hard part is what you put inside those JSON files β and whether it actually reads like the language it's supposed to be.
I've been thinking about this a lot while building a multilingual blog. So when a detailed benchmark dropped comparing LLMs specifically on translation quality, I read it carefully.
Here's what I found.
The Benchmark That Actually Matters for Developers
TokenMix published a comparison of major LLMs on translation in April 2026, using the COMET metric β an industry-standard measure that correlates more closely with human judgment than BLEU scores do.
The headline number for Claude Sonnet 4.6: 0.885 COMET overall. For European languages specifically, that climbs to 0.898. For Chinese-English pairs, it sits at 0.878.
For context: anything above 0.85 on COMET is generally considered professional quality. Anything above 0.90 starts to approach the upper bound of what human reviewers reliably distinguish from a native-speaking professional translator.
The 98% instruction-following rate is the number that matters most for production use. When you're translating marketing copy and you've written a prompt that says "keep the tone casual, preserve the product name, never translate UI labels" β you need the model to actually follow that. GPT-class models hover around 87-91% on instruction compliance in this benchmark. Claude Sonnet 4.6 at 98% means far fewer post-processing exceptions.
What This Means for Your i18n Pipeline
Most Next.js i18n pipelines I've seen are assembled something like this:
- Write content in English (or your primary language)
- Run it through an AI translation step
- Human review for high-stakes strings (pricing, legal, error messages)
- Auto-approve the rest
- Ship
The problem is step 4. "Auto-approve the rest" is where tone collapses. A product that sounds warm and conversational in English can read like a formal document in French, or like a machine-translated support ticket in Chinese.
This is where COMET scores start to matter practically. A higher COMET score correlates with better preservation of pragmatic meaning β the feel of a sentence, not just its semantic content.
For European language pairs, Claude Sonnet 4.6's 0.898 score suggests you can expand step 4's scope significantly β most UI copy, blog content, and marketing strings should pass auto-review without obvious degradation.
For Chinese-English pairs, the 0.878 score is still solid for factual content and product UI. Where you'll still want human review: idioms, culturally-specific humor, and anything that relies on connotation rather than denotation. The model handles the structure well; it sometimes miscalibrates the register.
The Cost Equation
The benchmark puts Claude Sonnet 4.6 at approximately $23.40 per million words translated. That number sounds abstract until you do the math on a real content site.
A medium-sized SaaS with 50 pages of marketing copy averages around 75,000 words in English. Supporting 10 languages means translating roughly 750,000 words. At this pricing, that's about $17.55 total β for all 10 languages.
The human translator alternative for 10 language pairs would typically run $8,000β$25,000+ depending on language pairs and quality tier.
For initial translations: AI is clearly cost-justified. For ongoing content (blog posts, changelogs, in-app copy updates): there's essentially no reason not to automate the first pass.
How to Wire This Into Your Next.js App
If you're using next-intl, here's the basic pattern for an automated translation pipeline:
Step 1: Maintain a canonical English file
Keep en.json as your source of truth. Every other locale file is derived from it.
Step 2: Detect what's missing or stale
Write a script that diffs en.json against your other locale files and outputs a list of keys that need translation or review.
Step 3: Run targeted translation calls Instead of translating the entire file every time, translate only the diff. This cuts costs and avoids re-introducing already-reviewed translations.
Step 4: Include context in your prompt This is where most pipelines fail. "Translate this" gives you acceptable results. "Translate this for a B2B SaaS aimed at technical founders. Tone: casual and direct. Preserve product names: VibeCom. Do not translate UI element labels." gives you something you can ship.
Step 5: Flag low-confidence translations for review You can ask the model to return a confidence score alongside each translation. Set a threshold (e.g., flag anything below 0.80) for human review queue.
Where I'd Still Use Human Review
Despite the strong benchmark numbers, there are three categories where I'd still route to a human:
Legal and compliance copy β error rate tolerance is zero, and the model's instruction compliance at 98% still means roughly 1 in 50 strings drifts from spec.
Culturally-loaded marketing copy β slogans, taglines, and anything that relies on a cultural reference or pun. The model translates meaning; it doesn't always find the culturally equivalent punchline.
First-time translations into a new high-stakes language β before you trust the auto-review gate for a new language, run a sample batch and have a native speaker evaluate. Set your quality baseline first, then scale.
The Practical Takeaway
If you're building a multilingual Next.js product and you've been deferring the i18n work because "AI translation isn't good enough" β that objection is increasingly outdated.
For 80% of use cases β product UI, documentation, marketing copy in European languages, factual content in Chinese β Claude Sonnet 4.6 at a 0.885 COMET score is good enough to auto-approve. The cost is low enough that there's no reason to batch the translations; you can do it continuously as content changes.
The remaining 20% β tone-sensitive copy, cultural adaptation, high-stakes legal strings β still needs human eyes. But you've just taken 80% of the translation workload off your plate.
The i18n pipeline that used to require a localization agency or a team of contractors is now a well-prompted API call and a review queue. That's a genuine change in what a solo developer can ship.
The TokenMix benchmark data referenced in this post is from their April 2026 LLM translation comparison at tokenmix.ai/blog/best-llm-for-translation.
