arXiv AI recent: The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
Researchers studied the reliability of LLM-as-a-Judge, a widely used method for ranking model outputs and training reward models.,They used two OpenAI judge models, GPT-4o-mini and GPT-4....
The study used 50 pairwise trials and 50 pointwise trials per question, with temperature and prompt-sensitivity ablations.,The results showed that GPT-4o-mini exhibited a significant first-position bias, and mean pointwise score gaps were small and not statistically significant in aggregate.