LLM-as-a-judge emerged as a powerful evaluation tool in 2023–24 - and a dangerous one when used without calibration, guardrails, and human baselines.
By late 2023, many teams had quietly adopted LLMs as evaluators of other LLMs. This unlocked faster iteration cycles - but also created a new failure surface: bad judges that confidently scored bad answers as good. The systems that avoided this trap treated judges as first-class models with their own evaluation plans.
Where LLM Judges Worked Well
LLM judges shone at relative comparisons ("A or B?"), regression detection ("did this version get worse on this set?"), and flagging interesting outliers for human review. They struggled when asked to independently declare outputs "good" or "bad" without strong rubrics and examples.
Designing a Judge You Can Trust
Treating the judge as a model in its own right - with its own prompt, evaluation set, and regression tests - was the key shift that made LLM-as-a-judge safe enough for serious use. Teams that simply dropped a generic chat model into an evaluation prompt without calibration often got misleading scores and overfit to the judge’s quirks.
- 1Start with a small, high-quality human-labelled set of outputs with clear scoring rubrics.
- 2Calibrate the judge against that set, checking for agreement and understanding of the rubric.
- 3Freeze the judge prompt and version once calibrated, and treat changes as code changes with review.
- 4Monitor judge drift over time by periodically re-running the human-labelled set.
Building something in this space?
We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.
Book a 30-minute callKey takeaways
- LLM-as-a-judge works best as a ranking signal, not an absolute arbiter
- Calibration against human-labelled sets is mandatory before trusting scores
- Using different models for system and judge reduces some bias, but not all
- Score explanations were invaluable for debugging both systems and judges
- Judges themselves need evaluation and regression testing over time