Stack Truth · May 2026 · 6 min

The smallest RAG eval loop that actually catches drift

Most RAG pipelines pass the demo and fail in production — and the failures are quiet: an answer that sounds grounded but isn't, the right documents retrieved in the wrong order, a chunk cut one line short of the answer. You can't catch any of it by eyeballing outputs. Here's the smallest eval loop that does.

Why RAG passes demos and fails production

The failures are predictable: hallucinated answers that are technically grounded in retrieved context, retrieval that returns the right documents in the wrong order, and chunks that contain the answer but get cut at the wrong boundary (Prem AI). None of these is visible by reading a few outputs, and with roughly 70% of engineering teams now running RAG in production or planning to within a year, eval infrastructure is a prerequisite, not a nice-to-have.

Two surfaces, four metrics

A RAG system fails in two distinct places, and you have to measure both:

Retrieval — did you fetch the right context? Generation — did the answer use the context correctly? Good retrieval can't save a bad generation step, and a perfect generator can't rescue garbage retrieval. Four canonical metrics cover the common failure modes when read together as a panel:

Metric	What it checks	Common alert threshold
Context precision	Of what was retrieved, how much was relevant	< 0.70
Context recall	Of what was needed, how much was retrieved	< 0.80
Faithfulness	Are the answer's claims supported by the context	< 0.75
Answer relevancy	Does the answer actually address the question	< 0.80

RAGAS-style metrics, scored by an LLM-as-judge without labelled ground truth. Thresholds are common starting points (see Sources), not universal — calibrate them against your own golden set. Faithfulness is computed as supported claims ÷ total claims.

The smallest loop that works

You do not need a heavyweight platform to start. The minimum viable loop is four moving parts:

1. A golden set. 30–100 real questions with known-good answers, drawn from actual usage. This is the whole foundation — build it first.
2. An LLM judge. Score the four metrics with a strong model breaking each answer into claims and checking them against the retrieved chunks. It runs without ground-truth labels and costs roughly $0.001–$0.003 per test case (datavlab) — cheap enough to run constantly.
3. A CI gate. Run the panel on every change to prompts, chunking, the index, or the model. If faithfulness drops below your bar, the change doesn't ship. This is what turns "it felt fine" into a number.
4. A production flywheel. Sample real traffic, turn the failures into new golden-set cases, re-run. Every failure becomes a permanent test.

Catching drift

Drift is the reason the loop has to live in production, not just pre-launch. Your documents change, the underlying model gets updated, and usage patterns shift — and any of the three can quietly degrade answers that used to be fine (Braintrust). The flywheel catches it because the same panel runs continuously against fresh traffic, so a faithfulness dip after a model update shows up as a failing metric instead of a wave of user complaints three weeks later.

A RAG system without an eval loop isn't stable. It's untested, and quietly drifting toward the first answer that costs you trust.

The lesson

The smallest loop — golden set, LLM judge, CI gate, production flywheel — is a few days of setup and the difference between a RAG system you can change with confidence and one you're afraid to touch. Start small. The panel matters more than the platform.

You can't eyeball faithfulness. Measure it, gate on it, and let production feed the test set.

Sources

More Stack Truth

Stack Truth

Claude Opus 4.8 vs Gemini 3.1 Pro for agents: real numbers

→