Creating a LLM-as-a-Judge
Hamel Husain publishes a 6,000-word guide on building LLM judges using critique shadowing to align language models with domain experts and address untrusted data in AI teams.
Anthropic released details on Claude 3.5 SWEBench+SWEAgent, while OpenAI introduced SimpleQA and DeepMind launched NotebookLM. Apple announced new M4 Macbooks, and a new SOTA image model, Recraft v3, emerged. Hamel Husain presented a detailed 6,000-word treatise on creating LLM judges using a method called critique shadowing to align LLMs with domain experts, addressing the problem of untrusted and unused data in AI teams.
The workflow involves expert-reviewed datasets and iterative prompt refinement. Additionally, Zep introduced a temporal knowledge graph memory layer to improve AI agent memory and reduce hallucinations. Anthropic also integrated Claude 3.5 Sonnet with GitHub Copilot, expanding access to Copilot Chat users.
- news.smol.aiCreating a LLM-as-a-Judgeprimary