Architecture
How the ingest, retrieve, delete, and rerank pipelines work from end to end.
Understanding how Unrag processes your content helps you make good decisions about chunking parameters, embedding models, and performance optimization. Let's walk through both pipelines in detail.
The components
Every Unrag instance is built from three pluggable pieces:
The embedding provider turns text into vectors. It has one job: take a string, call an embedding model (local or remote), and return an array of numbers representing that text's position in semantic space. The default provider uses the Vercel AI SDK to call OpenAI models, but you can implement the interface for any embedding service.
The store adapter handles persistence. It knows how to write documents, chunks, and their embeddings to your database, and how to query for similar vectors. Unrag ships adapters for Drizzle, Prisma, and raw SQL—all targeting Postgres with pgvector.
The chunker splits documents into smaller pieces. The default implementation uses token-based recursive chunking with the o200k_base tokenizer (same as GPT-5, GPT-4o). It splits at natural boundaries (paragraphs, sentences, clauses) while respecting token limits. For specialized content, you can install plugin chunkers (markdown, code, semantic) or provide your own chunker function.
These components are assembled in your unrag.config.ts file to create a ContextEngine instance. The engine coordinates the components but doesn't contain business logic itself—it just calls the right methods in the right order.
The ingest pipeline
When you call engine.ingest({ sourceId, content, metadata, assets }), here's what happens:
Chunking
The content string is passed to the chunker function along with your configured chunk size and overlap. The chunker returns an array of chunk objects, each containing the text, its position index, and an approximate token count.
For example, with the default chunk size of 512 tokens and 50-token overlap, a 1500-token document becomes roughly 3-4 chunks. Each chunk shares some text with its neighbors, which helps preserve context across chunk boundaries.
Asset processing (optional)
If you provide assets (or a connector like Notion or Google Drive does), Unrag can also turn rich media into chunks:
- PDFs: if enabled via
assetProcessing.pdf.llmExtraction.enabled, Unrag sends the PDF to an LLM (default: Gemini via AI Gateway) to extract text, then chunks and embeds that extracted text like any other content. - Images: if your embedding provider supports image embeddings (
embedImage), Unrag can embed the image directly in the same vector space as text queries. If not, it falls back to embedding captions when available. - Other asset kinds (audio/video/files): in v1 they are skipped (or can be configured to fail ingest) unless you implement additional extraction.
Embedding
Each chunk is sent to the embedding provider. This happens concurrently to minimize total latency. The provider returns a vector (array of floats) for each chunk. These vectors encode the semantic meaning of the chunk text.
If you're using OpenAI's text-embedding-3-small model, each vector has 1,536 dimensions. Larger models produce higher-dimensional vectors (text-embedding-3-large uses 3,072 dimensions).
Storage
The store adapter receives the chunks with their embeddings and writes them to your database in a transaction:
- Insert (or update) a row in the
documentstable with the full content and metadata - Insert rows in the
chunkstable for each chunk - Insert rows in the
embeddingstable with the vector for each chunk
If you ingest with a sourceId that already exists, the adapter updates the existing document. The old chunks and embeddings are replaced with new ones.
The ingest method returns detailed timing information. Embedding is typically the slowest step due to API latency.
const result = await engine.ingest({ sourceId: "doc-1", content: "..." });
console.log(result.durations);
// { totalMs: 1523, chunkingMs: 2, embeddingMs: 1456, storageMs: 65 }The retrieve pipeline
When you call engine.retrieve({ query, topK, scope }), the flow is simpler:
Query embedding
Your query string is passed to the same embedding provider used during ingestion. This produces a vector representing the query's semantic meaning.
Using the same embedding model for queries and documents is essential. Different models produce vectors in different semantic spaces that aren't directly comparable.
Similarity search
The store adapter receives the query embedding, the desired number of results (topK), and any scope filters. It runs a SQL query that computes the distance between the query vector and every stored chunk embedding, then returns the closest matches.
With pgvector, this uses the <=> operator for cosine distance. Lower distances mean higher similarity. The adapter sorts ascending and limits to topK results.
Response assembly
The engine packages the chunks, their scores, timing information, and metadata into a response object and returns it.
const result = await engine.retrieve({ query: "how does auth work?", topK: 5 });
console.log(result.durations);
// { totalMs: 234, embeddingMs: 189, retrievalMs: 45 }The delete pipeline
When you call engine.delete({ sourceId }) or engine.delete({ sourceIdPrefix }), the engine removes content from the database:
Identify documents
For exact deletion (sourceId), the adapter finds the single document with that source ID. For prefix deletion (sourceIdPrefix), it finds all documents whose source ID starts with the given prefix.
Cascade delete
The adapter deletes the matching document rows. Thanks to foreign key constraints with ON DELETE CASCADE, the database automatically removes the associated chunks and embeddings. This keeps your index consistent without requiring multiple queries.
Deletion is useful for:
- Removing outdated content after updates
- Honoring user deletion requests (GDPR/privacy compliance)
- Cleaning up test data
- Removing entire namespaces (e.g.,
tenant:acme:prefix)
Improving retrieval with reranking
Vector similarity search is fast but imprecise. The embedding distance between a query and a chunk is a useful approximation of relevance, but it's not perfect. Chunks that are semantically related to your query might rank higher than chunks that directly answer it.
Reranking addresses this with a two-stage approach. First, you retrieve a larger set of candidates using fast vector search—maybe 30 chunks instead of 10. Then you run those candidates through a reranker that directly scores each one against the query. The reranker is more expensive (it sees both the query and candidate text), but it makes much better relevance judgments.
// Stage 1: Fast vector retrieval
const retrieved = await engine.retrieve({ query, topK: 30 });
// Stage 2: Precise reranking
const reranked = await engine.rerank({
query,
candidates: retrieved.chunks,
topK: 10,
});Reranking is optional and requires installing the reranker battery. The default implementation uses Cohere's rerank-v3.5 model, but you can bring your own reranker. See the Reranker documentation for details.
Where RAG happens
It's worth noting what Unrag doesn't do: it doesn't call an LLM, build prompts, or generate answers. The "retrieval" in RAG is what Unrag handles. The "augmented generation" part—taking retrieved chunks and using them as context for an LLM—is your application's responsibility.
This separation is intentional. Prompt engineering, context window management, streaming responses, and model selection vary dramatically by use case. Unrag gives you the retrieval primitive and trusts you to build the right generation layer for your application.
A typical RAG flow looks like:
- User submits a question
- Call
engine.retrieve()to get relevant chunks - Format chunks into a context string
- Build a prompt with system instructions, context, and the user's question
- Call your LLM and stream the response
Unrag handles step 2. Everything else is your code, using whatever patterns and libraries you prefer.
Deep dive: RAG pipelines
For a comprehensive guide to RAG architecture—including the quality-latency-cost tradeoffs at each stage, common failure modes, and production considerations—see the RAG Handbook. The Orientation module covers the two-pipeline mental model in depth.
