§ evals · storyline
Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels
Apple Machine Learning Research publishes study showing correlated errors in multi-judge LLM evaluation panels reduce effective voting power from nine judges to two.
Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels Apple Machine Learning Research
§ sources1 publication · timeline below
- Apple Machine Learning ResearchNine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panelsprimary