Eval Harness
Measure and improve retrieval quality with deterministic evaluation, metrics, and CI integration.
Evaluation is currently experimental. It’s safe to use, but expect some CLI flags, report fields, and defaults to change as the harness matures.
The eval harness is a battery that adds retrieval evaluation capabilities to your Unrag installation. It gives you a structured way to define test datasets, run your retrieval pipeline against them, compute standard metrics (hit@k, recall@k, precision@k, MRR@k), and track quality changes over time.
Unlike the reranker battery which adds a new method to your engine, the eval harness is primarily a development and CI tool. You use it to measure how well your retrieval works, catch regressions before they reach production, and make informed decisions when tuning chunking, embeddings, or adding reranking.
Installing the eval battery
bunx unrag@latest add battery evalThis creates several files:
It also adds two npm scripts to your package.json:
{
"scripts": {
"unrag:eval": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json",
"unrag:eval:ci": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json --ci"
}
}Running your first eval
After installation, run the sample evaluation:
bun run unrag:evalThe harness will ingest the sample documents, run the test queries, and write report files. You'll see output like:
[unrag:eval] Wrote report: .unrag/eval/runs/<timestamp>-sample/report.json
[unrag:eval] Wrote summary: .unrag/eval/runs/<timestamp>-sample/summary.md
[unrag:eval] Thresholds: passFull documentation
The eval harness is a substantial feature with its own documentation section covering everything from dataset design to CI integration:
Evaluation Overview
Why retrieval evaluation matters and how the harness works
Getting Started
Complete setup guide with your first evaluation
Dataset Format
How to structure documents, queries, and ground truth
Understanding Metrics
What each metric measures and how to interpret results
Running Evals
All configuration options and CLI flags
CI Integration
Automated quality gates and threshold checking
Comparing Runs
Baseline diffs and tracking changes over time
RAG Handbook: Evaluation
Comprehensive guide to measuring RAG quality, building evaluation datasets, and offline/online evaluation strategies
