LLM-as-a-Judge in Practice: How We Used It Safely (2023–2025)

LLM-as-a-judge emerged as a powerful evaluation tool in 2023–24 - and a dangerous one when used without calibration, guardrails, and human baselines.

By late 2023, many teams had quietly adopted LLMs as evaluators of other LLMs. This unlocked faster iteration cycles - but also created a new failure surface: bad judges that confidently scored bad answers as good. The systems that avoided this trap treated judges as first-class models with their own evaluation plans.

Where LLM Judges Worked Well

LLM judges shone at relative comparisons ("A or B?"), regression detection ("did this version get worse on this set?"), and flagging interesting outliers for human review. They struggled when asked to independently declare outputs "good" or "bad" without strong rubrics and examples.

Designing a Judge You Can Trust

Treating the judge as a model in its own right - with its own prompt, evaluation set, and regression tests - was the key shift that made LLM-as-a-judge safe enough for serious use. Teams that simply dropped a generic chat model into an evaluation prompt without calibration often got misleading scores and overfit to the judge’s quirks.

1Start with a small, high-quality human-labelled set of outputs with clear scoring rubrics.
2Calibrate the judge against that set, checking for agreement and understanding of the rubric.
3Freeze the judge prompt and version once calibrated, and treat changes as code changes with review.
4Monitor judge drift over time by periodically re-running the human-labelled set.

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

LLM-as-a-judge works best as a ranking signal, not an absolute arbiter
Calibration against human-labelled sets is mandatory before trusting scores
Using different models for system and judge reduces some bias, but not all
Score explanations were invaluable for debugging both systems and judges
Judges themselves need evaluation and regression testing over time

Back to all articles

LLM-as-a-Judge in Practice: How We Used It Safely (2023–2025)

Where LLM Judges Worked Well

Designing a Judge You Can Trust

RAG Evaluation: The Metrics That Actually Mattered in 2023–24

Why Agentic AI Fails in Production (And How to Prevent It)

Building Multi-Agent Systems That Don't Break in Production