arXiv AI recent: Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability
The authors introduced Metric Match, a method that selects a subset of samples for human annotation to estimate correlation-based reliability metrics of LLM judges. Empirical results show...
Metric Match works by choosing a subset whose synthetic label‑based reliability matches the population metric. Experiments covered four correlation metrics and 15 datasets, reporting specific performance improvements and a $1,041.67 saving in a medical annotation scenario.