152 of 5,478 crawled domains (approximately 2.8%) deployed active bot-detection measures—captcha delivery or perimeter protection—that blocked automated OpenWPM crawling entirely. The authors note this disproportionately excludes untrustworthy sites, biasing the training dataset toward well-resourced trustworthy outlets and limiting recall on the untrustworthy class.
From 2025-sivan-sevilla-probing — Probing the third-party infrastructure of digital news on the Web
· §4 Limitations
· 2025
· Free and Open Communications on the Internet
Implications
Anti-crawling mechanisms (CAPTCHA, perimeter detection) create systematic blind spots in structural measurement pipelines; circumvention infrastructure that deploys similar bot-deterrence may evade passive structural fingerprinting even if the underlying request tree would be discriminative.
Measurement studies that exclude bot-resistant domains should be treated as lower bounds on classifier performance against adversarially-hardened targets.