2025-ahmed-llm-censorship-bias

An Analysis of Chinese Censorship Bias in LLMs

Abstract

When a large language model (LLM) is trained on text shaped by state censorship, those biases implicitly impact the outputs of the model. The authors define this phenomenon as censorship bias: a model trained on sanitized content is less likely to reflect prohibited views and more likely to reflect permitted ones, particularly when interacted with in a language predominantly used in a region with strong censorship laws. They introduce a methodology for identifying and measuring censorship bias and apply it to popular LLMs, including building CensorshipDetector, a Chinese-language classifier that distinguishes sanitized from non-sanitized text with 91% accuracy. The evaluation finds evidence of censorship bias across all models tested and discusses harms (notably the export of domestic information manipulation to diaspora populations) and mitigations.

Tags

censors
cn
techniques
keyword-filtering