ChatGPT and OpenAI think in Chinese and no one understands why

Saturday 18th January 2025 03:19 PM

An intriguing phenomenon, but not isolated

Several users reported that O1, in the middle of a reasoning process, appeared to structure parts of his responses in Chinese, Persian, or Hindi before providing his final response in English. On X (formerly Twitter), a user even observed: “Why did O1 suddenly start thinking in Chinese?” »

This phenomenon is not isolated: some researchers and AI professionals see it as a consequence of the biases introduced by the data and methods of training language models.

Hypotheses around linguistic emergence

The influence of training data and labeling

According to Clément Delangue, CEO of Hugging Face, and Ted Xiao, researcher at Google DeepMind, the origin of this behavior could be attributed to the data used to train the model. OpenAI, like other AI labs, outsources some of the data labeling to companies in China or other countries where specialized labor is plentiful and affordable. These labels – annotations that help the model understand and categorize information – could therefore carry inherent linguistic biases.

Ted Xiao also points out that certain complex data, particularly in mathematics or science, often come from sources written or translated into Chinese. This could push the model to favor this language for reasoning in these areas.

Linguistic efficiency and structuring of ideas

Other experts put forward a different theory. According to AI researcher Matthew Guzdial, O1 does not distinguish languages like a human would. It processes information in terms of tokens (units of text) and could “choose” a language that optimizes its reasoning process. Indeed, certain languages, such as Chinese, allow complex ideas to be transmitted with fewer characters, which could be perceived as more effective by the model.

A problem of transparency

Luca Soldaini, researcher at the Allen Institute for AI, laments that the operation of large models remains opaquemaking an in-depth analysis of the causes of this phenomenon impossible. He emphasizes: “This is yet another example of why transparency is crucial in the design of AI systems. »

A turbulent context for OpenAI

OpenAI, a leader in the field of artificial intelligence, faces major challenges in 2024 and 2025. If its innovations like ChatGPT, DALL-E and now O1 continue to capture global attention, obstacles pile up:

Growing financial pressure

With an estimated turnover of $3.7 billion in 2024but a projected loss of 5 billion dollarsOpenAI is facing unprecedented financial pressure. The company is targeting an ambitious turnover of $11.6 billion in 2025but this reliance on profitability forced the organization to move toward for-profit status, a move that drew criticism.

Criticisms and controversies

Accusations of copyright infringement : Eight major US newspapers sued OpenAI for using copyrighted content to train its models without permission.
Strategic departures : Chief Technology Officer Mira Murati has left her role, marking a significant loss for the company.
Reliability issues : Models like ChatGPT continue to generate biased or incorrect responses, damaging the company's reputation.

What does case O1 reveal?

The linguistic behavior of O1 highlights broader issues around the design and transparency of artificial intelligence models:

Bias management : Whether through labeling or training data, cultural and linguistic biases influence the models' reasoning.
The need for transparency : Users and researchers demand clear information on the internal processes of models to better understand their decisions and their operation.
The limits of the universality of models : Although they are presented as capable of handling multiple languages and contexts, large models like O1 show that their results depend heavily on their training.

Conclusion: an opportunity for development

The linguistic phenomenon observed with O1 is not just a technical anomaly. It raises fundamental questions about the ethics, design and impact of artificial intelligence models. For OpenAI, this is an opportunity to strengthen its communication and improve the design of its tools.

How the company handles these challenges, including promoting greater transparency and correcting bias, could well determine its future in an increasingly competitive industry. But above all, it will have to answer a crucial question: how to guarantee that artificial intelligence remains at the service of users, without compromising on integrity and reliability?