Subjecthood desk method note: We report the discourse. We do not assert AI systems are or are not conscious. We label position families.

arXiv AI recent: The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

2026-06-15 arxiv.org

Researchers studied the reliability of LLM-as-a-Judge, a widely used method for ranking model outputs and training reward models.,They used two OpenAI judge models, GPT-4o-mini and GPT-4....

The study used 50 pairwise trials and 50 pointwise trials per question, with temperature and prompt-sensitivity ablations.,The results showed that GPT-4o-mini exhibited a significant first-position bias, and mean pointwise score gaps were small and not statistically significant in aggregate.

Sources

arXiv AI recent challenge