I18n AI Translation in Next.js: What Claude Sonnet 4.6 Actually Gets Right (And Where It Falls Short)

If you've ever shipped a multilingual Next.js app, you know the part nobody talks about: translation quality at scale is a genuine engineering problem, not a content problem.

You can wire up next-intl or i18next in an afternoon. The hard part is what you put inside those JSON files — and whether it actually reads like the language it's supposed to be.

I've been thinking about this a lot while building a multilingual blog. So when a detailed benchmark dropped comparing LLMs specifically on translation quality, I read it carefully.

Here's what I found.

The Benchmark That Actually Matters for Developers

TokenMix published a comparison of major LLMs on translation in April 2026, using the COMET metric — an industry-standard measure that correlates more closely with human judgment than BLEU scores do.

The headline number for Claude Sonnet 4.6: 0.885 COMET overall. For European languages specifically, that climbs to 0.898. For Chinese-English pairs, it sits at 0.878.

For context: anything above 0.85 on COMET is generally considered professional quality. Anything above 0.90 starts to approach the upper bound of what human reviewers reliably distinguish from a native-speaking professional translator.

The 98% instruction-following rate is the number that matters most for production use. When you're translating marketing copy and you've written a prompt that says "keep the tone casual, preserve the product name, never translate UI labels" — you need the model to actually follow that. GPT-class models hover around 87-91% on instruction compliance in this benchmark. Claude Sonnet 4.6 at 98% means far fewer post-processing exceptions.

What This Means for Your i18n Pipeline

Most Next.js i18n pipelines I've seen are assembled something like this:

Write content in English (or your primary language)
Run it through an AI translation step
Human review for high-stakes strings (pricing, legal, error messages)
Auto-approve the rest
Ship

The problem is step 4. "Auto-approve the rest" is where tone collapses. A product that sounds warm and conversational in English can read like a formal document in French, or like a machine-translated support ticket in Chinese.

This is where COMET scores start to matter practically. A higher COMET score correlates with better preservation of pragmatic meaning — the feel of a sentence, not just its semantic content.

For European language pairs, Claude Sonnet 4.6's 0.898 score suggests you can expand step 4's scope significantly — most UI copy, blog content, and marketing strings should pass auto-review without obvious degradation.

For Chinese-English pairs, the 0.878 score is still solid for factual content and product UI. Where you'll still want human review: idioms, culturally-specific humor, and anything that relies on connotation rather than denotation. The model handles the structure well; it sometimes miscalibrates the register.

The Cost Equation

The benchmark puts Claude Sonnet 4.6 at approximately $23.40 per million words translated. That number sounds abstract until you do the math on a real content site.

A medium-sized SaaS with 50 pages of marketing copy averages around 75,000 words in English. Supporting 10 languages means translating roughly 750,000 words. At this pricing, that's about $17.55 total — for all 10 languages.

The human translator alternative for 10 language pairs would typically run $8,000–$25,000+ depending on language pairs and quality tier.

For initial translations: AI is clearly cost-justified. For ongoing content (blog posts, changelogs, in-app copy updates): there's essentially no reason not to automate the first pass.

How to Wire This Into Your Next.js App

If you're using next-intl, here's the basic pattern for an automated translation pipeline:

Step 1: Maintain a canonical English file Keep en.json as your source of truth. Every other locale file is derived from it.

Step 2: Detect what's missing or stale Write a script that diffs en.json against your other locale files and outputs a list of keys that need translation or review.

Step 3: Run targeted translation calls Instead of translating the entire file every time, translate only the diff. This cuts costs and avoids re-introducing already-reviewed translations.

Step 4: Include context in your prompt This is where most pipelines fail. "Translate this" gives you acceptable results. "Translate this for a B2B SaaS aimed at technical founders. Tone: casual and direct. Preserve product names: VibeCom. Do not translate UI element labels." gives you something you can ship.

Step 5: Flag low-confidence translations for review You can ask the model to return a confidence score alongside each translation. Set a threshold (e.g., flag anything below 0.80) for human review queue.

Where I'd Still Use Human Review

Despite the strong benchmark numbers, there are three categories where I'd still route to a human:

Legal and compliance copy — error rate tolerance is zero, and the model's instruction compliance at 98% still means roughly 1 in 50 strings drifts from spec.

Culturally-loaded marketing copy — slogans, taglines, and anything that relies on a cultural reference or pun. The model translates meaning; it doesn't always find the culturally equivalent punchline.

First-time translations into a new high-stakes language — before you trust the auto-review gate for a new language, run a sample batch and have a native speaker evaluate. Set your quality baseline first, then scale.

The Practical Takeaway

If you're building a multilingual Next.js product and you've been deferring the i18n work because "AI translation isn't good enough" — that objection is increasingly outdated.

For 80% of use cases — product UI, documentation, marketing copy in European languages, factual content in Chinese — Claude Sonnet 4.6 at a 0.885 COMET score is good enough to auto-approve. The cost is low enough that there's no reason to batch the translations; you can do it continuously as content changes.

The remaining 20% — tone-sensitive copy, cultural adaptation, high-stakes legal strings — still needs human eyes. But you've just taken 80% of the translation workload off your plate.

The i18n pipeline that used to require a localization agency or a team of contractors is now a well-prompted API call and a review queue. That's a genuine change in what a solo developer can ship.

The TokenMix benchmark data referenced in this post is from their April 2026 LLM translation comparison at tokenmix.ai/blog/best-llm-for-translation.