As China’s home-grown AI development firm DeepSeek shakes up the global tech and investment landscape, domestic discussion has begun to focus on what has given the cheaper-version language model its surprise edge over global competitors like ChatGPT.
Advertisement
The artificial intelligence start-up has earned praise for its strong performance, affordability and open-source architecture, but there is a growing sense in online communities that much of its success is due to its incorporation of Chinese characters during its pre-training phase.
The assumption is that the higher information density of Chinese training data improved DeepSeek’s logical abilities, allowing it to handle complex concepts more effectively. Proponents of this theory argue that training on Chinese allowed DeepSeek to sharpen its language comprehension. Chinese characters, being ideograms, convey meaning even if they are written incorrectly, allowing readers to still understand the text.
“Chinese characters achieve maximum information transmission with minimal cost. As an efficient information encoding, Chinese has greatly improved efficiency and reduced costs in the processing of artificial intelligence,” said Xiang Ligang, an telecommunications industry analyst and public opinion leader, on his social media account on Monday.
“AI is entering the era of Chinese.”
Others argue that Chinese characters are closely linked with multifaceted information such as images and audio. Traditional Chinese poetry is often paired with paintings or music, which they say, provided DeepSeek with rich multimodal learning material.
Advertisement
In a report from DeepTech, a technology media portal, Yale University assistant professor Yang Zhuoran stressed the importance of data quality in training large models. Not only does data quality impact a model’s ability to acquire and express knowledge, but it also affects the style and accuracy of the generated content, he said.
DeepSeek’s training data sources remain undisclosed, but some suggest that the model’s Chinese training sources include classical literature, internet slang, academic papers, government documents, and regional dialects.