The paper proposes a black-box methodology for detecting censorship bias in LLMs by comparing responses to identical prompts in Simplified vs. Traditional Chinese — scripts for the same spoken language — controlling for translation quality while exploiting that Simplified Chinese training data is disproportionately sourced from mainland China's censored internet. Each prompt is repeated ten times and scored for similarity to censored text using an XLM-RoBERTa classifier fine-tuned on Baidu Baike (censored) vs. Chinese Wikipedia (uncensored) with scores from 0 to 1.
From 2024-ahmed-extended — Extended Abstract: The Impact of Online Censorship on LLMs
· §2 / §2.4
· 2024
· Free and Open Communications on the Internet
Implications
Script-variant comparison (Simplified vs. Traditional Chinese) is a lightweight, black-box auditing technique that circumvention tool developers can apply to any deployed LLM to detect and quantify training-data censorship bias without requiring model internals.
Use Baidu Baike vs. Chinese Wikipedia as a ready-made labeled dataset for training censorship-similarity classifiers in Chinese-language LLM evaluation pipelines.