English-Accented Large Language Models

When OpenAI introduced GPT-1 in 2018, much of the public was eager to experiment with this cutting-edge technology. While it initially enhanced productivity by outperforming humans in repetitive tasks, over time it became a source of dependency. This breakthrough also marked the start of an AI arms race. In the years that followed, other major tech companies like Google, Meta, and DeepSeek built upon the open-source foundation and released their own optimized models. Given that most of these companies are headquartered in the United States, their training data were overwhelmingly in English. This article offers a sociolinguistic reflection on the implications of this Anglo-dominance in the development of large language models (LLMs), drawing on insights from Guo et al.'s recent paper.

According to Guo et al., although most LLMs claim to be multilingual, they often respond with an “English accent” due to a “reliance on English as an internal pivot language” (2). The authors analyzed a number of LLMs, such as Qwen, Mistral, and LLaMA, developed by different organizations and presumably trained on varying data sources. They evaluated the lexical and syntactic naturalness of each model's output and compared it against human baselines.

Table of Lexical and Syntatic Divergence

The table above reports the lexical divergence between model generated and human language across three languages: English, Chinese, and French (5). Higher values indicate greater divergence and, therefore, lower naturalness. Unsurprisingly, all models performed worse than human speakers. What stands out is that the divergence scores were consistently highest for Chinese, followed by French, and lowest for English. This trend can be explained not only by the dominance of English in training corpora but also by linguistic proximity: English and French share greater structural and lexical affinities than English and Chinese.

Interestingly, Qwen, a model trained in China, performed the worst in generating natural Chinese, even trailing behind LLaMA (6). The authors suggest this may be due to Qwen's reliance on synthetic data during training, whereas LLaMA largely avoided such data. Although synthetic data can be useful in low resource settings, it often fails to capture the nuance of natural language, limiting a model's ability to improve on measures of naturalness. As the authors note, it remains difficult to avoid synthetic data altogether, since “non-translated, … Supervised Fine-Tuning datasets for non-English languages are almost non-existent” (7).

In an effort to mitigate these accented outputs, the authors experimented with modifying the training. Given the scarcity of high quality fine-tuning data in non-English languages, they used paraphrasing and back-translation to generate training samples, which were then explicitly rejected during training. This rejection-based method led to models that produced output with significantly lower lexical and syntactic divergence.

Yet, while such algorithmic strategies mark important technical progress, they do not resolve the structural biases at the heart of LLM development. The continuous inaccuracy in linguistic representation, where English remains the implicit default and other languages are treated as peripheral, raises deeper questions about whose voices are being prioritized and universalized through AI. Even the most sophisticated training adjustments cannot fully compensate for the underrepresentation of certain languages in the datasets. As LLMs continue to be deployed across global contexts, it becomes imperative to move beyond mere technical fixes and instead foreground multilingual equity as a central design principle.