AI8 min read

RAG evaluation that actually works in production

Sarah KemblePrincipal AI EngineerMay 28, 2026

Retrieval-augmented generation looks deceptively simple in a demo: embed some documents, retrieve the closest chunks, and let the model answer. The trouble starts the moment real users ask real questions against messy, evolving data.

The teams that succeed treat evaluation like a first-class part of the system, not an afterthought. Before we ship a single answer to production, we build a golden dataset of question/answer pairs drawn from real user intent, and we score retrieval quality and answer faithfulness on every change.

Citation accuracy is the metric compliance teams care about most. An answer that is correct but unsupported is still a liability. We instrument every response so that each claim can be traced back to its source — and we fail the build if citation coverage drops.

Finally, evaluation never stops at launch. We log live queries, sample them for human review, and feed regressions back into the golden set. The result is a system that gets measurably better over time instead of silently drifting.

RAG evaluation that actually works in production

Keep reading

Designing AI agents you can actually trust with money

Building dashboards people actually open every morning

How we ship sub-second Next.js products without cutting corners

Let's build something intelligent together