Unrag
Appendix

Evaluation checklist

A practical checklist for setting up evals and monitoring.

Evaluation is how you know whether your RAG system works. This checklist covers setting up a robust evaluation practice—not just running evals once, but building them into your development and operations workflow.


Dataset fundamentals

  • Eval dataset exists. You have a defined set of queries with expected results or ground truth answers.

  • Dataset covers major use cases. Each primary use case your system is designed for is represented in the dataset.

  • Dataset includes edge cases. Queries that are ambiguous, complex, or likely to fail are included, not just the easy cases.

  • Dataset includes negative cases. Queries that should result in refusal ("I don't know") are included to test refusal behavior.

  • Dataset is realistic. Queries resemble what real users actually ask, not just what developers imagine they'll ask.

  • Dataset is versioned. Changes to the dataset are tracked so you can compare results over time.

  • Dataset maintenance is assigned. Someone owns keeping the dataset current as the system evolves.


Ground truth quality

  • Ground truth is documented. For each query, you've recorded what a correct answer looks like.

  • Ground truth includes sources. You know which documents should be retrieved to answer each query.

  • Ground truth is validated. Someone has reviewed the ground truth annotations for accuracy.

  • Ambiguous cases are acknowledged. Where multiple answers could be correct, the evaluation accounts for this.


Metric selection

  • Retrieval metrics are tracked. You're measuring recall@K, precision@K, MRR, or similar for the retrieval stage.

  • Generation metrics are tracked. You're measuring faithfulness, correctness, or usefulness for the generation stage.

  • Metrics match product goals. The metrics you're optimizing actually matter for your use case (high-stakes domains prioritize faithfulness; efficiency domains prioritize task completion).

  • Metric definitions are documented. The team agrees on how each metric is calculated and what thresholds are acceptable.


Retrieval evaluation

  • You can identify relevant documents. For each test query, you know which document(s) should be retrieved.

  • Retrieval is evaluated independently. You measure retrieval quality separately from generation quality.

  • topK settings are validated. You've confirmed that your topK value balances recall with noise.

  • Thresholds are calibrated. If you use similarity thresholds, you've tested that they filter appropriately.


Generation evaluation

  • Faithfulness is measured. You check whether generated answers are supported by the provided context.

  • Citation accuracy is verified (if applicable). If your system cites sources, you verify that citations are correct.

  • Refusal behavior is tested. Queries that should refuse are confirmed to refuse appropriately.

  • Format compliance is checked. Answers match expected format (length, structure, style).


LLM-as-judge setup

  • Rubric is defined. If using LLM-as-judge, you have a clear rubric specifying what constitutes good/bad answers.

  • Judge prompt is tested. The judge prompt produces consistent, reasonable scores on sample cases.

  • Judge has access to context. The judge sees the retrieved context, not just the question and answer, to assess faithfulness.

  • Judge scores are validated against humans. A sample of judge scores has been compared to human ratings to verify alignment.

  • Judge model is appropriate. The judge model is capable enough to evaluate responses reliably.


Regression testing

  • Baseline is established. You have stored results from a known-good configuration to compare against.

  • Regressions are detected. When metrics degrade beyond a threshold, it's flagged.

  • Blocking regressions are defined. You've decided which regressions should block deployment and which are acceptable.

  • CI integration is in place (if applicable). Evals run automatically on relevant changes.


Slicing and debugging

  • Results are sliceable. You can break down metrics by query type, content category, or other dimensions.

  • Failure cases are reviewable. You can inspect individual failing queries to understand what went wrong.

  • Retrieval and generation failures are distinguished. You can tell whether a bad answer came from bad retrieval or bad generation.


Online evaluation

  • Production metrics are tracked. You're measuring something in production (task completion, user satisfaction, escalation rate).

  • User feedback is collected. Users can rate or flag responses, and this data is captured.

  • Feedback is actionable. Negative feedback can be traced to specific queries and investigated.

  • Feedback loop exists. Production feedback informs dataset updates and system improvements.


Monitoring and alerting

  • Quality metrics are monitored over time. You can see trends in retrieval and generation quality.

  • Alerts exist for quality degradation. Significant drops in key metrics trigger investigation.

  • Evaluation runs regularly. Evals aren't just for launch—they run periodically to catch drift.


Process and ownership

  • Eval process is documented. The team knows how to run evals, interpret results, and update datasets.

  • Eval ownership is assigned. Someone is responsible for maintaining the evaluation infrastructure and datasets.

  • Results are reviewed. Eval results are actively used to make decisions, not just collected.

  • Dataset evolves with the product. As features change and usage patterns shift, the eval dataset is updated to match.

On this page