§ feed · storyline

Clémentine Fourrier on LLM evals

Hugging Face's Clémentine Fourrier presents at ICLR on LLM evaluation methods, covering automated benchmarking, human judges, and model-as-judge approaches alongside their limitations.

May 24 · 01:34:22 · primary fetch1 sourceupdated May 24 · 01:34:22

Clémentine Fourrier from Huggingface presented at ICLR about GAIA with Meta and shared insights on LLM evaluation methods. The blog outlines three main evaluation approaches: Automated Benchmarking using sample inputs/outputs and metrics, Human Judges involving grading and ranking with methods like Vibe-checks, Arena, and systematic annotations, and Models as Judges using generalist or specialist models with noted biases.

Challenges include data contamination, subjectivity, and bias in scoring. These evaluations help prevent regressions, rank models, and track progress in the field.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiClémentine Fourrier on LLM evalsprimary01:34:22