Unrag
Concepts

Performance

Understanding and optimizing ingest and retrieve performance.

Unrag is designed to be fast by default, but understanding where time goes helps you optimize for your specific workload. Let's look at the performance characteristics of both operations and the knobs you can turn.

Where time goes during ingestion

When you call engine.ingest(), the timing breakdown typically looks like this:

{
  totalMs: 1523,
  chunkingMs: 2,      // Almost instant
  embeddingMs: 1456,  // Dominates
  storageMs: 65       // Fast with connection pooling
}

Chunking is nearly instant—splitting text is cheap. Unless you're processing enormous documents with complex custom chunkers, this will never be your bottleneck.

Performance tip

Embedding is where the time goes. A naive approach would call the embedding API once per chunk, but Unrag is smarter about this. When your embedding provider supports batch embedding (via embedMany()), Unrag groups chunks together and sends them in batches, dramatically reducing API overhead. For providers that don't support batching, Unrag runs individual embedding calls concurrently.

For OpenAI's embedding API, a single embedding call takes roughly 100-200ms. With batching enabled (the default when supported), a 30-chunk document might require only a single API call instead of 30 individual calls. When batching isn't available, concurrent requests still keep total latency low—a 10-chunk document processes in about 300-500ms total rather than 1-2 seconds sequentially.

Storage is fast if your database connection is healthy. Writing a document with 10 chunks and embeddings takes 20-100ms with a warmed connection pool. Cold connections or distant databases add latency.

Tuning embedding throughput

Unrag gives you two knobs to control embedding performance: concurrency and batch size. These live in your unrag.config.ts under defaults.embedding:

export const unrag = defineUnragConfig({
  defaults: {
    embedding: {
      concurrency: 4,  // Max concurrent embedding requests
      batchSize: 32,   // Chunks per embedMany call (when supported)
    },
  },
  // ...
});

Concurrency controls how many embedding requests run in parallel. This applies to both text embeddings and image embeddings. The default of 4 is conservative—it keeps you safely under most providers' rate limits. If you're on a tier with generous rate limits, bumping this to 8 or 10 can speed up large ingestions. If you're hitting rate limit errors, lower it to 2.

Batch size controls how many text chunks go into each embedMany() call. When your embedding provider supports batch embedding (most do), Unrag groups chunks together rather than calling the API once per chunk. This reduces HTTP overhead and often costs less, since many providers charge per-request in addition to per-token. The default of 32 works well for most models; larger values may hit token limits on some providers.

The combination matters. With concurrency: 4 and batchSize: 32, Unrag can embed up to 128 chunks simultaneously across 4 concurrent batch requests. That's enough for most real-time ingestion. For bulk imports where you're processing thousands of documents, you might increase concurrency while adding pauses between documents to avoid overwhelming your provider.

Optimizing ingestion

If ingestion is too slow for your use case, the tuning options above are your first stop. Beyond that, here are additional strategies:

Tune concurrency and batch size. Before reaching for architectural changes, try adjusting defaults.embedding.concurrency and defaults.embedding.batchSize. Increasing concurrency from 4 to 8 can nearly double throughput if your provider allows it. If you're seeing rate limit errors, lower concurrency instead.

Use a background job queue. For user-triggered ingestion (like file uploads), return immediately and process in a background job. This keeps your API response times fast while still getting content indexed. See Next.js Production Recipe for a complete example using QStash or BullMQ.

Adjust chunk size. Larger chunks mean fewer embedding calls. If you're embedding hundreds of documents and retrieval quality is acceptable with 300-word chunks instead of 150-word chunks, you'll cut your embedding calls roughly in half.

Consider local embedding models. If you're embedding at scale and have GPU resources, running your own embedding model eliminates API latency entirely. Models like nomic-embed-text via Ollama run fast on modest hardware and produce good results for many use cases. See Ollama for setup.

Where time goes during retrieval

Retrieval is typically faster than ingestion:

{
  totalMs: 234,
  embeddingMs: 189,   // Single embedding call
  retrievalMs: 45     // Database query
}

Embedding the query is a single API call—just your search string, not multiple chunks. This takes 100-200ms with a cloud embedding provider.

The database query runs a vector similarity search and returns results. With proper indexing and a well-configured database, this takes 10-100ms depending on your data size and query complexity.

Optimizing retrieval

Cache query embeddings. If users frequently search for the same terms, cache the embedding vectors. The embedding API call is the slowest part—skipping it for common queries dramatically improves response times.

const cache = new Map<string, number[]>();

async function embedWithCache(query: string): Promise<number[]> {
  const cached = cache.get(query);
  if (cached) return cached;
  
  const embedding = await embeddingProvider.embed({ text: query, ... });
  cache.set(query, embedding);
  return embedding;
}

Where time goes during reranking

If you're using the reranker battery, there's an additional stage after retrieval:

{
  rerankMs: 187,   // Reranker API call
  totalMs: 192     // Total including preprocessing
}

The reranker adds 100-300ms depending on the number of candidates and their text lengths. The Cohere reranker processes all candidates in a single API call, so the latency doesn't scale linearly with candidate count—30 candidates might take 180ms while 10 candidates take 120ms.

Optimizing reranking

Tune your candidate count. Retrieve 20-50 candidates for reranking, not hundreds. The best results are almost always in the initial top 30-50 from vector search. Retrieving 100+ candidates increases rerank time without proportional quality improvement.

Consider async reranking. If your UI can progressively display results, return vector search results immediately and rerank in the background. This gives users fast initial results while improved rankings load.

Cache reranked results. For frequently repeated queries, cache the full reranked result. The cache key should include both the query and some fingerprint of the candidates (like a hash of their IDs), since content updates change retrieval results.

Skip reranking for simple queries. If analytics show certain query patterns get excellent results from vector search alone, bypass reranking for those cases. Very specific queries ("error code E-1234") often don't benefit from reranking.

Add a vector index

For datasets with more than 50,000 chunks, add an HNSW index to speed up similarity search:

create index embeddings_hnsw_idx 
on embeddings using hnsw (embedding vector_cosine_ops);

This trades some recall accuracy for dramatically faster queries. For most applications, the tradeoff is worthwhile.

Tune topK. Don't retrieve more results than you need. Each additional result adds (small) overhead to the database query and increases the data transferred over the wire.

Connection pooling

The generated unrag.config.ts uses a singleton pattern for database connections:

const pool = (globalThis as any).__unragPool ?? new Pool({ connectionString });
(globalThis as any).__unragPool = pool;

This prevents connection pool exhaustion during development with hot reloading, where modules get re-executed on every change. In production, it ensures you reuse connections across requests.

For serverless environments (Vercel, AWS Lambda, etc.), consider using:

  • Neon's serverless driver for automatic connection pooling
  • Supabase's connection pooler endpoint
  • AWS RDS Proxy for RDS connections
  • PgBouncer if you're managing your own infrastructure

These tools pool connections across function invocations, preventing the "too many connections" errors common in serverless deployments.

Monitoring and debugging

The timing information returned by both ingest() and retrieve() is your primary tool for understanding performance. Log these values in production:

const result = await engine.retrieve({ query, topK: 10 });

console.log({
  query,
  resultCount: result.chunks.length,
  embeddingMs: result.durations.embeddingMs,
  retrievalMs: result.durations.retrievalMs,
  totalMs: result.durations.totalMs,
});

Watch for:

  • Sudden increases in embedding time: Usually indicates API issues or rate limiting
  • High storage/retrieval times: Check database connection health, missing indexes, or lock contention
  • Consistent slowness: May indicate undersized database instance or network latency issues

The explicit timing breakdown makes it straightforward to identify which component needs attention.

Production operations

For a comprehensive guide to operating RAG systems at scale—including latency budgets, cost controls, observability patterns, and scaling strategies—see Module 8: Production and Operations in the RAG Handbook.

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.