RAG Evaluation: The Metrics That Actually Mattered in 2023–24

RAGAS scores were a good starting point in 2023. The teams that avoided painful incidents added domain-specific metrics for faithfulness, context precision, and hallucination rate.

RAG (Retrieval-Augmented Generation) went from niche pattern to default enterprise choice between 2022 and 2024. Along the way, teams discovered that “it seems to work” was not good enough - especially in legal, healthcare, and financial services. The difference between safe and scary deployments was rarely in the retrieval stack itself, and almost always in how carefully the system was evaluated.

Early pilots were often judged on "does it answer something?" rather than "does it answer correctly, consistently, and traceably?" Once systems hit real production traffic, this gap showed up as hallucinated citations, confidently wrong answers, and support teams quietly reverting to manual workflows.

Four Metrics That Predict Real-World Performance

1. Faithfulness

Faithfulness measures whether each factual statement in the answer can be grounded in retrieved context. In 2023–24, the loudest RAG failures happened when systems silently mixed model prior knowledge with enterprise documents. Faithfulness forces you to ask: "Can we point to where this came from?"

Practical way to score it

Break the answer into atomic claims, then check each one against the retrieved passages. Your score is the percentage of claims that can be directly supported. Below ~85% in a regulated context should be treated as a red flag.

2. Context Precision

If you retrieve the wrong passages, even a perfect generator will answer badly. Precision measures how many of the top-k retrieved chunks genuinely help answer the question. Low precision is usually a chunking, metadata, or embedding choice problem - not a model issue.

In 2023, many teams discovered that their "smart" embeddings were less of a problem than their document structure. Giant, multi-page chunks with mixed topics made it hard to retrieve exactly what mattered. Simple fixes - like splitting by headings and injecting rich metadata - often lifted precision dramatically.

3. Answer Relevance

Relevance asks whether the answer actually addressed the user's question. We saw many systems that confidently rephrased irrelevant passages. These looked fine on superficial checks but failed when compared against what the user actually needed.

Does the answer stay on topic, or wander into loosely related content?
Does it address every part of multi‑clause questions?
Does it clearly say "I don’t know" when the corpus lacks an answer?

4. Hallucination Rate

Hallucinations aren’t just "wrong answers" - they’re fabricated facts delivered with confidence. For regulated domains, even a small hallucination rate is unacceptable. The best teams treated hallucination as a measurable, reducible property of the system, not an inevitable side-effect of LLMs.

A practical target we saw emerge by 2024: keep hallucination rates below 1-2% on high‑stakes query sets, and below 5% on general internal knowledge assistants - with clear signalling when confidence is low.

Designing an Evaluation Suite That Surfaces Real Failures

Metric names are not enough; what matters is the dataset you run them on. The strongest RAG evaluation suites we saw between 2022–25 had three parts: curated edge cases from production, representative everyday questions from real users, and adversarial prompts designed with domain experts.

1Start with 50–100 real questions from tickets, chats, or search logs.
2Work with domain experts to label correct answers and common traps.
3Add known "gotcha" queries (similar titles, overlapping policies, ambiguous phrasing).
4Refresh 10–20% of the set every quarter as products, policies, and user behaviour change.

From One‑Off Evaluation to Continuous Assurance

By late 2024, mature teams weren’t just running evaluation suites at launch; they wired them into CI. Every change to prompts, chunking, retrievers, or models triggered a run against the evaluation set, with hard gates on key metrics.

Deployment rule of thumb

Don’t ship a change that regresses faithfulness or context precision by more than a few percentage points on your critical evaluation sets - even if other metrics improve.

Treating RAG evaluation as an ongoing discipline rather than a launch checklist was the difference between assistants that quietly got better over time and those that gradually drifted until nobody trusted them.

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

Generic RAG benchmarks hide the failure modes that matter in regulated domains
Faithfulness and context precision are more important than fluency
Evaluation sets must be built from real user questions, not synthetic prompts
Human-labelled baselines are still the gold standard for calibration
Evaluation needs to run before and after every deployment, not just at launch

Back to all articles