FINDING · EVALUATION
Among inaccessible URLs that also triggered OONI anomalies, approximately 58% were generated by the Top2Vec-Trends pipeline (combining Top2Vec topic modeling with Google Trends keyword expansion), while LDA-TFIDF and Top2Vec alone each accounted for only 13–14%. BERTopic-generated pages were least effective at producing censored candidates.
From 2024-tang-automatic — Automatic Generation of Web Censorship Probe Lists · §5.4 · 2024 · Privacy Enhancing Technologies
Implications
- Incorporating real-time trend signals (e.g., Google Trends) into keyword expansion substantially increases discovery of newly censored content; circumvention services should monitor trending topics in censored regions to anticipate blocking events before they affect users.
- Topic modeling that uses a unified vector space (Top2Vec) outperforms bag-of-words approaches for surfacing censored content; future probe list automation should weight trend-aware embeddings over TF-IDF keyword extraction.
Tags
Extracted by claude-sonnet-4-6 — review before relying.