FINDING · EVALUATION

Of 326 websites known to adhere to CCP censorship laws — including Chinese government sites and state media — 325 were found indexed in the Common Crawl dataset commonly used to train major LLMs including GPT-3. Only the official government site of Macao (www.gov.mo) was absent, indicating that LLM training corpora are broadly contaminated with CCP-censored content.

From 2024-ahmed-extendedExtended Abstract: The Impact of Online Censorship on LLMs · §3.1 · 2024 · Free and Open Communications on the Internet

Implications

Tags

censors
cn
techniques
keyword-filtering

Extracted by claude-sonnet-4-6 — review before relying.