FINDING · EVALUATION

The paper proposes a black-box methodology for detecting censorship bias in LLMs by comparing responses to identical prompts in Simplified vs. Traditional Chinese — scripts for the same spoken language — controlling for translation quality while exploiting that Simplified Chinese training data is disproportionately sourced from mainland China's censored internet. Each prompt is repeated ten times and scored for similarity to censored text using an XLM-RoBERTa classifier fine-tuned on Baidu Baike (censored) vs. Chinese Wikipedia (uncensored) with scores from 0 to 1.

From 2024-ahmed-extendedExtended Abstract: The Impact of Online Censorship on LLMs · §2 / §2.4 · 2024 · Free and Open Communications on the Internet

Implications

Tags

censors
cn
techniques
keyword-filteringml-classifiermeasurement-platform

Extracted by claude-sonnet-4-6 — review before relying.