When the researchers attempted to use Gemini 2.5 Flash as a third independent LLM judge
via its API for evaluating moderation decisions, Gemini automatically blocked all
judging attempts citing safety reasons. This occurred even though the research task
(judging whether a response is more or less moderated) does not itself produce harmful
content. The incident illustrates that LLM safety systems can over-block legitimate
research use cases, and that different LLM providers have different thresholds—
Claude Haiku 4.5 and GPT-4o completed all judging tasks without safety refusals.
From 2026-lipphardt-dual — Dual Standards: Examining Content Moderation Disparities Between API and WebUI Interfaces in Large Language Models
· §3.3.3
· 2026
· Free and Open Communications on the Internet
Implications
Researchers using LLMs as automated classifiers for sensitive content should anticipate provider-specific refusals and design studies with multiple independent LLM judges from different providers to avoid single-provider failure modes.