Use with Chat
Integrate Unrag retrieval into a conversational AI interface.
Unrag handles the "retrieval" in RAG. The "augmented generation" part—using retrieved chunks to ground an LLM's responses—is where you take over. This guide shows common patterns for building chat interfaces that use Unrag for context.
The basic RAG chat flow
A RAG-powered chat follows this sequence:
User asks a question
The user submits a query through your chat interface.
Retrieve relevant chunks
Your app calls Unrag to retrieve relevant chunks based on the question.
Build the prompt
Your app formats the chunks as context and builds a system prompt.
Call the LLM
Your app sends the prompt and question to an LLM.
Stream the response
The LLM response is streamed back to the user for a better experience.
Here's a minimal implementation:
import { createUnragEngine } from "@unrag/config";
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
export async function answerQuestion(question: string) {
const engine = createUnragEngine();
// Step 1: Retrieve relevant context
const { chunks } = await engine.retrieve({
query: question,
topK: 5,
});
// Step 2: Format context for the prompt
const context = chunks
.map((chunk) => `[Source: ${chunk.sourceId}]\n${chunk.content}`)
.join("\n\n---\n\n");
// Step 3: Build the prompt
const systemPrompt = `You are a helpful assistant. Answer the user's question based on the provided context. If the context doesn't contain relevant information, say so.
Context:
${context}`;
// Step 4: Stream the response
const result = await streamText({
model: openai("gpt-4o"),
system: systemPrompt,
prompt: question,
});
return result;
}Building effective prompts
How you format the context significantly impacts response quality. Here are patterns that work well:
Structured context blocks make it clear where each piece of information comes from:
const formatContext = (chunks: Chunk[]) => {
return chunks.map((chunk, i) => {
const source = chunk.metadata.title ?? chunk.sourceId;
return `### Source ${i + 1}: ${source}\n${chunk.content}`;
}).join("\n\n");
};Include relevance signals so the model knows which chunks are most pertinent:
const formatContextWithScores = (chunks: Chunk[]) => {
return chunks.map((chunk) => {
const relevance = chunk.score < 0.3 ? "HIGH" : chunk.score < 0.5 ? "MEDIUM" : "LOW";
return `[Relevance: ${relevance}]\n${chunk.content}`;
}).join("\n\n---\n\n");
};Truncate intelligently when context might exceed token limits:
const MAX_CONTEXT_CHARS = 8000;
const formatContextWithLimit = (chunks: Chunk[]) => {
let total = 0;
const included: string[] = [];
for (const chunk of chunks) {
if (total + chunk.content.length > MAX_CONTEXT_CHARS) break;
included.push(chunk.content);
total += chunk.content.length;
}
return included.join("\n\n---\n\n");
};Streaming responses in Next.js
For chat interfaces, streaming provides a better user experience. Here's a complete Next.js route handler:
// app/api/chat/route.ts
import { createUnragEngine } from "@unrag/config";
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
export async function POST(request: Request) {
const { messages } = await request.json();
const lastMessage = messages[messages.length - 1];
const question = lastMessage.content;
// Retrieve context
const engine = createUnragEngine();
const { chunks } = await engine.retrieve({
query: question,
topK: 5,
});
const context = chunks
.map((c) => `Source: ${c.sourceId}\n${c.content}`)
.join("\n\n---\n\n");
const systemPrompt = `You are a helpful assistant for our product documentation. Answer questions based on the provided context. Cite sources when possible.
Context:
${context}`;
const result = streamText({
model: openai("gpt-4o"),
system: systemPrompt,
messages,
});
return result.toDataStreamResponse();
}Use with the Vercel AI SDK's useChat hook on the frontend:
"use client";
import { useChat } from "ai/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
<button type="submit">Send</button>
</form>
</div>
);
}Returning sources with responses
Users often want to know where information came from. Include source references in your response:
export async function POST(request: Request) {
const { question } = await request.json();
const engine = createUnragEngine();
const { chunks } = await engine.retrieve({ query: question, topK: 5 });
// ... generate response ...
return Response.json({
answer: generatedText,
sources: chunks.map((c) => ({
id: c.sourceId,
title: c.metadata.title,
excerpt: c.content.substring(0, 200) + "...",
score: c.score,
})),
});
}Display sources alongside the response so users can verify or explore further.
Conversational context
For multi-turn conversations, you might want to retrieve based on the full conversation context, not just the latest message:
function extractSearchQuery(messages: Message[]): string {
// Option 1: Just use the last message
const lastMessage = messages[messages.length - 1];
return lastMessage.content;
// Option 2: Combine recent messages for context
const recentMessages = messages.slice(-3);
return recentMessages.map((m) => m.content).join(" ");
// Option 3: Use an LLM to generate a search query
// (More expensive but can be more accurate)
}
// In your handler
const searchQuery = extractSearchQuery(messages);
const { chunks } = await engine.retrieve({ query: searchQuery, topK: 5 });Handling "I don't know"
When retrieved context doesn't contain the answer, the model should admit it rather than hallucinate. Guide this in your system prompt:
const systemPrompt = `You are a helpful assistant. Answer based ONLY on the provided context.
IMPORTANT RULES:
- If the context doesn't contain information to answer the question, say "I don't have information about that in my knowledge base."
- Don't make up information that isn't in the context.
- If you're unsure, say so.
Context:
${context}`;You can also check retrieval quality before calling the LLM:
const { chunks } = await engine.retrieve({ query: question, topK: 5 });
// Check if we got good matches
const bestScore = chunks[0]?.score ?? 1;
if (bestScore > 0.7 || chunks.length === 0) {
return Response.json({
answer: "I don't have information about that topic in my knowledge base. Could you try rephrasing your question?",
sources: [],
});
}
// Proceed with generationScoping chat to specific content
For applications with multiple content collections, scope retrieval to relevant sources:
// Help chat only searches help articles
const helpEngine = () => ({
retrieve: (query: string) =>
engine.retrieve({ query, topK: 5, scope: { sourceId: "help:" } })
});
// Docs chat searches documentation
const docsEngine = () => ({
retrieve: (query: string) =>
engine.retrieve({ query, topK: 5, scope: { sourceId: "docs:" } })
});
// Route based on chat context
const scopedRetrieve = chatType === "help" ? helpEngine() : docsEngine();
const { chunks } = await scopedRetrieve.retrieve(question);Improving context quality with reranking
Vector similarity search is fast but imprecise. When you retrieve 5 chunks for an LLM prompt, the most relevant chunk might not be first—it could be third or fourth. This matters for chat because LLMs tend to weight information at the beginning of the context more heavily.
If you've installed the reranker battery, use it to ensure your best chunks come first:
const engine = createUnragEngine();
// Retrieve more candidates than we need
const retrieved = await engine.retrieve({
query: question,
topK: 20,
});
// Rerank to get the best 5
const reranked = await engine.rerank({
query: question,
candidates: retrieved.chunks,
topK: 5,
onMissingReranker: "skip", // Fall back to vector order if no reranker
});
// Use reranked chunks in the prompt
const context = reranked.chunks
.map((c) => `Source: ${c.sourceId}\n${c.content}`)
.join("\n\n---\n\n");This adds 100-300ms of latency but significantly improves the quality of context you provide to the LLM. The best chunks come first, which helps the model focus on the most relevant information.
For chat applications, this tradeoff is usually worth it—the LLM generation step takes much longer anyway, so the reranking latency is barely noticeable.
See the Reranker documentation for installation and configuration.
Performance considerations
Chat interfaces need to feel responsive. The latency breakdown is typically:
- Retrieval (embedding + database): 200-400ms
- Reranking (if enabled): 100-300ms
- LLM generation: 500-2000ms+ depending on response length
To improve perceived performance:
- Stream responses so users see text appearing immediately
- Cache embeddings for common questions
- Prefetch context when you can predict what the user might ask
- Show "thinking" indicators while retrieval happens
The retrieval and reranking steps are usually fast enough that users don't notice them if you stream the generation.
Learn more about RAG for chat
For a comprehensive guide to building RAG-powered chat interfaces—including prompt engineering, handling hallucinations, citation strategies, and agent patterns—see Module 6: Generation in the RAG Handbook.
