Scale vs. Quality: The Choice Facing China and the US
Tencent's generative AI, Yuanbao, recently caused a stir when it embedded profanity into a New Year's greeting image. The company chalked it up to "an error in processing a multi-turn conversation," but that explanation only scratches the surface.
When an AI swears at you, it isn't an accident. So why does it happen? Because the quality of the data a model learns from carries straight through to what it produces.
How do American and Chinese approaches to AI development differ? OpenAI poured enormous sums into vetting the sources and managing the quality of the data used to train GPT-4. Chinese companies, by contrast, opted to gather data through sheer brute force.
Let's look at the numbers. According to a 2023 report from Stanford's HAI (Human-Centered AI Institute), China's major AI firms collect, on average, ten times more data than their American counterparts—yet they spend only about a third as much on managing its quality.
Baidu announced that its Ernie Bot (Wenxin Yiyan) was trained on 550 billion Chinese-language tokens. But a substantial share of that data is believed to have been scraped indiscriminately from social media platforms like Weibo and TikTok.
The Trap of Mass Data Collection
So what's the real problem? China's internet environment poses a unique challenge for training AI. A 2024 report from the Cybersecurity Association of China (CSA) found that profanity and aggressive language appear 40 percent more frequently in Chinese online spaces than in other language communities.
The crux of the matter isn't the volume of data—it's the curation process. Tencent has access to 30 billion messages a day from its WeChat messenger alone. But that vast trove is laced indiscriminately with profanity, hate speech, and misinformation.
And how does OpenAI handle it? The company uses Common Crawl data, but only after running it through dozens of filtering stages. Notably, it employs a technique called Constitutional AI to train its models to suppress inappropriate responses on their own.
The Limits of Throwing Bodies at the Problem
Consider the comparison in raw figures. Alibaba's team behind Tongyi Qianwen revealed in 2023 that "1,000 annotators spent six months labeling data." GPT-4, however, drew on more than 5,000 experts who spent two years curating data.
Here lies a crucial lesson: simply throwing more people at the task cannot guarantee data quality. The relentless pressure on Chinese AI firms to ship products quickly only makes the problem worse. It's the result of "fast and plentiful" crowding out "accurate and safe."
A 2024 study from Google DeepMind concluded that "one terabyte of high-quality data outperforms ten terabytes of low-quality data." In other words, with data, more isn't better. Better is better.
Overlooking Cultural Context
The Yuanbao profanity incident wasn't a technical error—it was a failure of cultural context. The aggression and bluntness of the Chinese online culture the AI had absorbed simply surfaced, unfiltered.
This is exactly what we might call the "mirror effect of data culture." An AI is a mirror that faithfully reflects the linguistic habits of the society it learned from. And you can't blame the mirror for a reflection you don't like.
So what's the answer? It's that genuine AI competitiveness comes not from the quantity of data but from its quality. If Chinese AI hopes to earn global trust, it will have to abandon the brute-force approach and focus on innovating data quality.
Otherwise, profanity will keep finding its way into New Year's greetings for years to come.
Garbage in,
Garbage out.
Is there a clearer diagnosis than that?




