FINDING · EVALUATION

Multi-word Chinese phrases as search seeds discover qualitatively different censored sites than individual English words: the phrase 'Chinese human rights violation' surfaces Chinese activist homepages and culture-specific outlets, while individual constituent words return only well-known Western media. TF-IDF scoring against a Chinese corpus ranks culturally rare phrases (e.g., '自由亚洲电台' / Radio Free Asia) as high-signal seeds and discards common filler phrases.

From 2018-hounsel-automatically — Automatically Generating a Large, Culture-Specific Blocklist for China · §3.1–3.2 · 2018 · Free and Open Communications on the Internet

Implications

Tools that auto-categorize blocked content for client-side route selection should use multilingual NLP with native-language corpora—English-keyword matching alone misses the majority of Chinese-language censored domains.
Blocklist pipelines targeting China must incorporate Chinese n-gram extraction with TF-IDF to capture censored domains that English-only approaches structurally cannot discover.

Implications

Tags