Multi-word Chinese phrases as search seeds discover qualitatively different censored sites than individual English words: the phrase 'Chinese human rights violation' surfaces Chinese activist homepages and culture-specific outlets, while individual constituent words return only well-known Western media. TF-IDF scoring against a Chinese corpus ranks culturally rare phrases (e.g., '自由亚洲电台' / Radio Free Asia) as high-signal seeds and discards common filler phrases.
From 2018-hounsel-automatically — Automatically Generating a Large, Culture-Specific Blocklist for China
· §3.1–3.2
· 2018
· Free and Open Communications on the Internet
Implications
Tools that auto-categorize blocked content for client-side route selection should use multilingual NLP with native-language corpora—English-keyword matching alone misses the majority of Chinese-language censored domains.
Blocklist pipelines targeting China must incorporate Chinese n-gram extraction with TF-IDF to capture censored domains that English-only approaches structurally cannot discover.