Strokes of genius: why DeepSeek’s AI edge may come from its Chinese lessons


Rich language training data and a colourful cast of characters help power AI into the ‘era of Chinese’, experts say. — SCMP

As China’s home-grown AI development firm DeepSeek shakes up the global tech and investment landscape, domestic discussion has begun to focus on what has given the cheaper-version language model its surprise edge over global competitors like ChatGPT.

The artificial intelligence startup has earned praise for its strong performance, affordability and open-source architecture, but there is a growing sense in online communities that much of its success is due to its incorporation of Chinese characters during its pre-training phase.

The assumption is that the higher information density of Chinese training data improved DeepSeek’s logical abilities, allowing it to handle complex concepts more effectively. Proponents of this theory argue that training on Chinese allowed DeepSeek to sharpen its language comprehension. Chinese characters, being ideograms, convey meaning even if they are written incorrectly, allowing readers to still understand the text.

“Chinese characters achieve maximum information transmission with minimal cost. As an efficient information encoding, Chinese has greatly improved efficiency and reduced costs in the processing of artificial intelligence,” said Xiang Ligang, an telecommunications industry analyst and public opinion leader, on his social media account on Monday.

“AI is entering the era of Chinese.”

Others argue that Chinese characters are closely linked with multifaceted information such as images and audio. Traditional Chinese poetry is often paired with paintings or music, which they say, provided DeepSeek with rich multimodal learning material.

In a report from DeepTech, a technology media portal, Yale University assistant professor Yang Zhuoran stressed the importance of data quality in training large models. Not only does data quality impact a model’s ability to acquire and express knowledge, but it also affects the style and accuracy of the generated content, he said.

HONOR 400 series debuts as brand captures quarter of all sold-out phones

DeepSeek’s training data sources remain undisclosed, but some suggest that the model’s Chinese training sources include classical literature, internet slang, academic papers, government documents, and regional dialects.

The speculation recalls concerns when ChatGPT first gained popularity. Critics feared that Chinese internet censorship would lead to a scarcity of Chinese-language data, which could then factor into the failure of China’s AI sector.

Some now argue, however, that the abstract nature of Internet language – shaped by China’s keyword censorship – may have played a beneficial role in the model’s training data.

Chinese Internet users often use homophones or indirect expressions to bypass censorship, resulting in more language complexities. A single character can have multiple meanings, making it challenging for AI at first. But according to a comment by one user, with more training, the model learns to understand and generate these cryptic expressions, improving its capabilities.

DeepSeek’s ability to handle Chinese seems to have impressed many. People have used it to write in classical Chinese, generate couplets, translate dialects, and even draft official documents, with several users commending it for surpassing the abilities of previous AI models.

The academic community tends to hold that using the Chinese language and sources for training is nothing new, and therefore, DeepSeek’s training model should not be considered entirely original. They believe that more critical core elements are the result of high-quality training data, training strategies, and extensive iterative optimisation.

Chinese tech blog Shi Yu Xing Kong points out that in the field of artificial intelligence there is no inherent language barrier in understanding human knowledge. In other words, regardless of whether it is Chinese or English, AI learns the same knowledge.

One notable example is that users interacting with DeepSeek’s AI in English may occasionally see Chinese pop-ups in the conversation. The phenomenon has been observed both in DeepSeek-R1 and the latest version of OpenAI’s O3-mini.

According to the DeepSeek-R1 technical report, the training process consisted of two stages. In the first stage, the research team collected a large amount of Chain of Thought data. This cold start data was used to fine-tune the DeepSeek-V3 basic model to ensure that it had a certain reasoning ability before entering the reinforcement learning (RL) stage.

The second phase, RL, involved the researchers designing rewards for accuracy and formatting. The reinforcement, which provided feedback on each generated response, guided the model’s optimisation and helped it adjust its generative tactics over time. – South China Morning Post

Follow us on our official WhatsApp channel for breaking news alerts and key updates!

Others Also Read


All Headlines:

Want to listen to full audio?

Unlock unlimited access to enjoy personalise features on the TheStar.com.my

Already a member? Log In