Among inaccessible URLs that also triggered OONI anomalies, approximately 58% were generated by the Top2Vec-Trends pipeline (combining Top2Vec topic modeling with Google Trends keyword expansion), while LDA-TFIDF and Top2Vec alone each accounted for only 13–14%. BERTopic-generated pages were least effective at producing censored candidates.
From 2024-tang-automatic — Automatic Generation of Web Censorship Probe Lists
· §5.4
· 2024
· Privacy Enhancing Technologies
Implications
Incorporating real-time trend signals (e.g., Google Trends) into keyword expansion substantially increases discovery of newly censored content; circumvention services should monitor trending topics in censored regions to anticipate blocking events before they affect users.
Topic modeling that uses a unified vector space (Top2Vec) outperforms bag-of-words approaches for surfacing censored content; future probe list automation should weight trend-aware embeddings over TF-IDF keyword extraction.