Architecture

Understanding how UnRAG processes your content helps you make good decisions about chunking parameters, embedding models, and performance optimization. Let's walk through both pipelines in detail.

The components

Every UnRAG instance is built from three pluggable pieces:

The embedding provider turns text into vectors. It has one job: take a string, call an embedding model (local or remote), and return an array of numbers representing that text's position in semantic space. The default provider uses the Vercel AI SDK to call OpenAI models, but you can implement the interface for any embedding service.

The store adapter handles persistence. It knows how to write documents, chunks, and their embeddings to your database, and how to query for similar vectors. UnRAG ships adapters for Drizzle, Prisma, and raw SQL—all targeting Postgres with pgvector.

The chunker splits documents into smaller pieces. The default implementation does simple word-based chunking with configurable size and overlap. If you need sentence-aware chunking, markdown-specific splitting, or token-based boundaries, you can provide your own chunker function.

These components are assembled in your unrag.config.ts file to create a ContextEngine instance. The engine coordinates the components but doesn't contain business logic itself—it just calls the right methods in the right order.

The ingest pipeline

When you call engine.ingest({ sourceId, content, metadata }), here's what happens:

Step 1: Chunking. The content string is passed to the chunker function along with your configured chunk size and overlap. The chunker returns an array of chunk objects, each containing the text, its position index, and an approximate token count.

For example, with a chunk size of 200 words and 40-word overlap, a 500-word document becomes roughly 3-4 chunks. Each chunk shares some text with its neighbors, which helps preserve context across chunk boundaries.

Step 2: Embedding. Each chunk is sent to the embedding provider. This happens concurrently to minimize total latency. The provider returns a vector (array of floats) for each chunk. These vectors encode the semantic meaning of the chunk text.

If you're using OpenAI's text-embedding-3-small model, each vector has 1,536 dimensions. Larger models produce higher-dimensional vectors (text-embedding-3-large uses 3,072 dimensions).

Step 3: Storage. The store adapter receives the chunks with their embeddings and writes them to your database. This happens in a transaction:

Insert (or update) a row in the documents table with the full content and metadata
Insert rows in the chunks table for each chunk
Insert rows in the embeddings table with the vector for each chunk

If you ingest with a sourceId that already exists, the adapter updates the existing document. The old chunks and embeddings are replaced with new ones.

Timing data. The ingest method returns detailed timing information showing how long each phase took. This helps you identify bottlenecks—embedding is typically the slowest step due to API latency.

const result = await engine.ingest({ sourceId: "doc-1", content: "..." });
console.log(result.durations);
// { totalMs: 1523, chunkingMs: 2, embeddingMs: 1456, storageMs: 65 }

The retrieve pipeline

When you call engine.retrieve({ query, topK, scope }), the flow is simpler:

Step 1: Query embedding. Your query string is passed to the same embedding provider used during ingestion. This produces a vector representing the query's semantic meaning.

Using the same embedding model for queries and documents is essential. Different models produce vectors in different semantic spaces that aren't directly comparable.

Step 2: Similarity search. The store adapter receives the query embedding, the desired number of results (topK), and any scope filters. It runs a SQL query that computes the distance between the query vector and every stored chunk embedding, then returns the closest matches.

With pgvector, this uses the <=> operator for cosine distance. Lower distances mean higher similarity. The adapter sorts ascending and limits to topK results.

Step 3: Response assembly. The engine packages the chunks, their scores, timing information, and metadata into a response object and returns it.

const result = await engine.retrieve({ query: "how does auth work?", topK: 5 });
console.log(result.durations);
// { totalMs: 234, embeddingMs: 189, retrievalMs: 45 }

Where RAG happens

It's worth noting what UnRAG doesn't do: it doesn't call an LLM, build prompts, or generate answers. The "retrieval" in RAG is what UnRAG handles. The "augmented generation" part—taking retrieved chunks and using them as context for an LLM—is your application's responsibility.

This separation is intentional. Prompt engineering, context window management, streaming responses, and model selection vary dramatically by use case. UnRAG gives you the retrieval primitive and trusts you to build the right generation layer for your application.

A typical RAG flow looks like:

User submits a question
Call engine.retrieve() to get relevant chunks
Format chunks into a context string
Build a prompt with system instructions, context, and the user's question
Call your LLM and stream the response

UnRAG handles step 2. Everything else is your code, using whatever patterns and libraries you prefer.

Architecture

The components

The ingest pipeline

The retrieve pipeline

Where RAG happens

On this page