Unrag
Batteries

Eval Harness

Measure and improve retrieval quality with deterministic evaluation, metrics, and CI integration.

Evaluation is currently experimental. It’s safe to use, but expect some CLI flags, report fields, and defaults to change as the harness matures.

The eval harness is a battery that adds retrieval evaluation capabilities to your Unrag installation. It gives you a structured way to define test datasets, run your retrieval pipeline against them, compute standard metrics (hit@k, recall@k, precision@k, MRR@k), and track quality changes over time.

Unlike the reranker battery which adds a new method to your engine, the eval harness is primarily a development and CI tool. You use it to measure how well your retrieval works, catch regressions before they reach production, and make informed decisions when tuning chunking, embeddings, or adding reranking.

Installing the eval battery

bunx unrag@latest add battery eval

This creates several files:

sample.json
config.json
unrag-eval.ts

It also adds two npm scripts to your package.json:

{
  "scripts": {
    "unrag:eval": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json",
    "unrag:eval:ci": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json --ci"
  }
}

Running your first eval

After installation, run the sample evaluation:

bun run unrag:eval

The harness will ingest the sample documents, run the test queries, and write report files. You'll see output like:

[unrag:eval] Wrote report: .unrag/eval/runs/<timestamp>-sample/report.json
[unrag:eval] Wrote summary: .unrag/eval/runs/<timestamp>-sample/summary.md
[unrag:eval] Thresholds: pass

Full documentation

The eval harness is a substantial feature with its own documentation section covering everything from dataset design to CI integration:

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.