Japanese-Chinese Translation with GenAI: What Works and What Doesn’t

How we developed a Gen-AI powered translation browser extension for daily reading purpose The post Japanese-Chinese Translation with GenAI: What Works and What Doesn’t appeared first on Towards Data Science.

Mar 27, 2025 - 20:57

Japanese-Chinese Translation with GenAI: What Works and What Doesn’t

Authors

Alex (Qian) Wan: Alex (Qian) is a designer specializing in AI for B2B products. She is currently working at Microsoft, focusing on machine learning and Copilot for data analysis. Previously, she was the Gen AI design lead at VMware.
Eli Ruoyong Hong : Eli is a design lead at Robert Bosch specializing in AI and immersive technology, developing systems that bridge technical innovation with human social dynamics to create more culturally aware and socially responsive technologies.

Background

Imagine you’re scrolling through social media and come across a post about a house makeover written in another language. Here’s a direct, word-for-word translation:

Finally, cleaned up this house completely and adjusted the design plan. Next, just waiting for the construction team to come in. Looking forward to the final result! Hope everything goes smoothly!
Illustration by Qian (Alex) Wan.

If you were the English translator, how would you translate this? Gen AI responded with:

I finally finished cleaning up this house and have adjusted the design plan. Now, I’m just waiting for the construction team to come in. I’m really looking forward to the final result and hope everything goes smoothly!

The translation seems to be clear and grammarly perfect. However, what if I told you this is a social post from a person who is notoriously known for exaggerating their wealth? They don’t own the house—they just left out the subject to make it seem like they do. Gen AI added “I” mistakenly without admitting the vagueness. A better translation would be:

The house has finally been cleaned up, and the design plan has been adjusted. Now, just waiting for the construction team to come in. Looking forward to seeing the final result—hope everything goes smoothly!

The languages where the “unstated” context plays an important role in literature and daily life are called “high-context language“.

Translating high-context languages such as Chinese and Japanese is uniquely challenging for many reasons. For instance, by omitting pronouns, and using metaphors that are highly associated with history or culture, translators are more dependent on context and are expected to have a deep knowledge of culture, history, and even differences among regions to ensure accuracy in translation.

This has been a long-time issue in traditional translation tools such as Google Translate and DeepL, but fortunately, we are in the era of Gen AI, the translation has significantly improved because of context-aware ability, and Gen AI is able to generate much more human-like content. Motivated by technological advancement, we decided to develop a Gen-AI powered translation browser extension for daily reading purpose.

Our extension uses Gen AI API. One of the challenges we encountered was choosing the AI model. Given the diverse options on the market, this has been a multi-month battle. We realized that there might be many people like us – not techy, with a lower budget, but interested in using Gen AI to bridge the language gap, so we tested 10 models with the hope of bringing insights to the audience.

This article documents our journey of testing different models for Chinese Japanese translation, evaluating the results based on specific criteria, and providing practical tips and tricks to resolve issues to increase translation quality.

Who might be interested in this article?

Anyone who is working or interested in using multi-language generative AI for topics like us: maybe you are a team member working for an AI-model tech company and looking for potential improvements. This article will help you understand the key factors that uniquely and significantly impact the accuracy of Chinese and Japanese translations.

It may also inspire you if you’re developing a Gen Ai Agent dedicated to language translation. If you happen to be someone who is looking for a high-quality Gen AI model for your daily reading translation, this article will guide you to select AI models based on your needs. You’ll also find tips and tricks to write better prompts that can significantly improve translation output quality.

Heads up

This article is primarily based on our own experience. We focused on certain Gen AI as of Feb 2, 2025 (when Gemini 2.0 and DeepSeek were released), so you might find some of our observations are different from current performance as AI models keep evolving.

We are non-experts, and we tried our best to show accurate info based on research and real testing. The work we did is purely for fun, self-learning and sharing, but we’re hoping to bring discussions to Gen AI’s cultural perspectives.

Many examples in this article are generated with the help of Gen AI to avoid copyright concerns.

Our initial Gen AI model selection

Our initial consideration was straightforward. Since our translation needs are related to Chinese, Japanese and English, the translation of the three languages was the priority. However, there were very few companies that detailed this ability specifically on their doc. The only thing we found is Gemini which specifies the performance of Multilingual.

Capability Multilingual
Benchmark Global MMLU (Lite)
Description MMLU translated by human translators into 15 languages. The lite version includes 200 Culturally Sensitive and 200 Culturally Agnostic samples per language.
Gemini 1.5 Flash 73.7%
Gemini 1.5 Pro 80.8%

Kavukcuoglu, Koray. 2025. “Gemini Model Updates.” Google DeepMind Blog, February. https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/.

Second, but equally important, is the price. We were cautious about the budget and tried not to go bankrupt because of the usage-based pricing model. So Gemini 1.5 Flash became our primary choice at that time. Other reasons we decided to proceed with this model are that it’s the most beginner-friendly option because of the well-documented instructions and it has a user-friendly testing environment–Gemini AI studio, which causes even less friction when deploying and scaling our project.

Backup models

Now Gemini 1.5 Flash has set a strong foundation, during our first dry run, we found it has some limitations. To ensure a smooth translation and reading experience, we have evaluated a few other models as backups:

Grok-beta (xAI): In late 2024, Grok didn’t have as much fame as OpenA’s models, but what attracted us was zero content filters (This is one of the issues we observed from AI models during translation, which will be discussed later). Grok offered $20 free credits per month before 2025, which makes it an attractive, budget-friendly option for frugal users like us.
Deepseek-V3: We integrated Deeseek right after its stride into market because it has richer Chinese training data than other alternatives (They collaborated with staff from Peking University for data labeling). Another reason is its jaw-dropping low price: With the discount, it was nearly 1/100 of Grok-beta. However, the high response time was a big issue.
OpenAI GPT-4o: It has good documentation and strong performance, but we didn’t really consider this as an option because there is no free tier for low-budget constraints. We used it as a reference but did not actively use it. We will integrate it later just for testing purposes.

We also explored a hybrid solution – providers that offer multiple models:

Groq w/ Deepseek: it is first an integrated model platform to deploy Deepseek. This version is distilled from Meta’s LLM, although it’s 72B makes it less powerful but with acceptable latency. They offered a free tier but with noticeable TPM constraints
Siliconflow: A platform with many Chinese model choices, and they offered free credits.

Translation quality issues

When using those models for daily translation (mostly between languages Simplified Chinese, Japanese, and English). We found that there are many noticeable issues.

1. Inconsistent translation of proper nouns/terminology

When a word or phrase has no official translation (or has different official translations), AI models like to produce inconsistent replies in the same document.

For example, the Japanese name “Asuka” has multiple potential translations in Chinese. Human translators usually choose one based on character setting (in some cases, there is a Japanese kanji reference for it, and the translator could simply use the Chinese version). For example, a female character could be translated into “明日香”, and a male character might be translated as “飞鸟” (more meaning-based) or “阿斯卡” (more phonetical-based). However, AI output sometimes switches between different versions of the same text.

There are also many different official translations for the same noun in the Chinese-speaking regions. One example is the spell “Expecto Patronum” in Harry Potter. This has two accepted translations:

Although I specify prompts to the AI to translate to simplified Chinese, it sometimes goes back and forth between simplified and the traditional Chinese version.

2. Overuse of pronouns

One thing that Gen AI often struggles with when translating from lower context language to higher context language is adding additional pronouns.

In Chinese or Japanese literature, there are a few ways when referring to a person. Like many other languages, third-person pronouns like She/Her are commonly used. To avoid ambiguity or repetition, the 2 approaches below are also very common:

Use character names.
Descriptive phrases (“the girl”, “the teacher”).

This writing preference is the reason that the pronoun use is much less frequent in Japanese and Chinese. In Chinese literature. The pronoun during translation to Chinese is only about 20-30%, and in Japanese, this number could go lower.

What I also want to emphasize is this: There is nothing right or wrong with how frequently, when, and where to add the additional pronoun (In fact, it’s a common practice for translators), but it has risks because it can make the translated sentence unnatural and not align with reader’s reading habit, or worse, misinterpret the intended meaning and cause mistranslation.

Below is a Japanese-to-English translation:

Original Japanese sentence (pronoun omitted)

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, go to conference room.

AI-generated translation (w/ incorrect pronoun)

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in his heart, he goes to the conference room.

In this case, the author intentionally avoids mentioning the pronoun, leaving room for interpretation. However, because the AI is trying to follow the grammar rules, it conflicts with the author’s design.

Better translation that preserves the original intent

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, heads to the conference room.

3. Incorrect pronoun usage in AI translation

The additional pronoun would potentially lead to a higher rate of incorrect pronouns caused by biased data; often, it’s gender-based errors. In the example above, the CEO is actually a woman, so this translation is incorrect. AI often defaults to male pronouns unless explicitly prompted

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in his heart, he she goes to the conference room.

Another common issue is AI overuses “I” in translations. For some reason, this issue persists across almost all models like GPT-4o, Gemini 1.5, Gemini 2.0, and Grok. GenAI models default to first-person pronouns when the subject is unclear.

4. Mix Kanji, Simplified Chinese, Traditional Chinese

Another issue we encountered was AI models mixing Simplified Chinese, Traditional Chinese, and Kanji in the output. Because of historical and linguistic reasons, many modern Kanji characters are visually similar to Chinese but have regional or semantic differences.

While some mix-use is incorrect but might be acceptable, for example:

Those three characters also look visually similar, and they share certain meanings, so it could be acceptable in some casual scenarios, but not for formal or professional communication.

However, other cases can lead to serious translation issues. Below is an example:

If AI directly uses this word when converting Japanese to Chinese (in a modern scenario), the sentence “Jane received a letter from her distant family” could end up with “Jane received a toilet paper from her distant family,” which is both incorrect and unintentionally funny.

Please note that the browser-rendered text can also have issues because of the lack of characters in the system font library.

5. Punctuation

Gen AI sometimes doesn’t do a great job of distinguishing punctuation differences between Chinese, Kanji and English. Below is one of the examples to show how different languages use distinct ways to write conversation (in modern common writing style):

This might seem minor but could impact professionalism.

6. False content filtering triggers

We also found that Gen AI content filter might be more sensitive to Japanese and Chinese (This happened when using Gemini 1.5 Flash). Even when the content was completely harmless. For example:

人並みにはできますよ！
I can do it at an average level!

Roughly speaking, there were about 2 out of 26 samples that triggered false content filters. This issue showed up randomly.

Evaluating Gen AI models

Completely out of curiosity and to better understand the Chinese/Japanese translation ability of different Gen AI models, we conducted structured testing on 10 models from 7 providers.

Testing setup

Task: Each AI model was used to translate an article written in Japanese into simplified Chinese through our translation extension. The Gen AI models were connected through API.

Sample: We selected a 30-paragraph third-person article. Each paragraph is a sample of which the character varies from 4 to 120.

Processed result: each model was tested three times, and we used the median result for analysis.

Evaluation metrics

We fully respect that the quality of translation is subjective, so we picked three metrics that are quantifiable and represent the challenges of high-context language translation.

Pronoun error rate

This metric represents the frequency of erroneous pronouns that appeared in the translated sample, which includes the following cases:

Gender pronoun incorrectness (e.g., using “he” instead of “she”).
Mistakenly switch from third-person pronoun to another perspective

A paragraph was marked as affected (+1) if any incorrect pronoun was detected.

Non-Chinese return rate

Some models randomly output Kanji, Hiragana, or Katakana in their responses. We were to count the samples that contained any of those, but every paragraph contained at least one non-Chinese character, so we adjusted our evaluation to make it more meaningful:

If the returned translation contains Hiragana, Katakana, or Kanji that affect readability, it will be counted as a translation error. For example: If the AI output 対 instead of 对, it won’t be flagged, since both are visually similar and do not affect meaning.
Our translation extension has a built-in non-Chinese characters function. If detected, the system retranslates the text up to three times. If the non-Chinese remains, it will display an error message.

Pronoun Addition Rate

If the translated sample contains any pronoun that doesn’t exist in the original paragraph, it will be flagged.

Scoring formula

All three metrics were calculated using the following formula.

Capability	Multilingual
Benchmark	Global MMLU (Lite)
Description	MMLU translated by human translators into 15 languages. The lite version includes 200 Culturally Sensitive and 200 Culturally Agnostic samples per language.
Gemini 1.5 Flash	73.7%
Gemini 1.5 Pro	80.8%