Eval Harness

Measure and improve retrieval quality with deterministic evaluation, metrics, and CI integration.

Evaluation is currently experimental. It’s safe to use, but expect some CLI flags, report fields, and defaults to change as the harness matures.

The eval harness is a battery that adds retrieval evaluation capabilities to your Unrag installation. It gives you a structured way to define test datasets, run your retrieval pipeline against them, compute standard metrics (hit@k, recall@k, precision@k, MRR@k), and track quality changes over time.

Unlike the reranker battery which adds a new method to your engine, the eval harness is primarily a development and CI tool. You use it to measure how well your retrieval works, catch regressions before they reach production, and make informed decisions when tuning chunking, embeddings, or adding reranking.

Installing the eval battery

bunx unrag@latest add battery eval

This creates several files:

sample.json

config.json

unrag-eval.ts

It also adds two npm scripts to your package.json:

{
  "scripts": {
    "unrag:eval": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json",
    "unrag:eval:ci": "bun run scripts/unrag-eval.ts -- --dataset .unrag/eval/datasets/sample.json --ci"
  }
}

Running your first eval

After installation, run the sample evaluation:

bun run unrag:eval

The harness will ingest the sample documents, run the test queries, and write report files. You'll see output like:

[unrag:eval] Wrote report: .unrag/eval/runs/<timestamp>-sample/report.json
[unrag:eval] Wrote summary: .unrag/eval/runs/<timestamp>-sample/summary.md
[unrag:eval] Thresholds: pass

Eval Harness

Installing the eval battery

Running your first eval

Full documentation

Evaluation Overview

Getting Started

Dataset Format

Understanding Metrics

Running Evals

CI Integration

Comparing Runs

RAG Handbook: Evaluation

On this page

Complete RAG Handbook