FINDING · DETECTION

When the researchers attempted to use Gemini 2.5 Flash as a third independent LLM judge via its API for evaluating moderation decisions, Gemini automatically blocked all judging attempts citing safety reasons. This occurred even though the research task (judging whether a response is more or less moderated) does not itself produce harmful content. The incident illustrates that LLM safety systems can over-block legitimate research use cases, and that different LLM providers have different thresholds— Claude Haiku 4.5 and GPT-4o completed all judging tasks without safety refusals.

From 2026-lipphardt-dualDual Standards: Examining Content Moderation Disparities Between API and WebUI Interfaces in Large Language Models · §3.3.3 · 2026 · Free and Open Communications on the Internet

Implications

Tags

censors
generic
techniques
ml-classifierkeyword-filtering

Extracted by claude-sonnet-4-6 — review before relying.