Unrag
Guides

Build a Search Endpoint

A complete guide to building a production-ready semantic search API.

This guide walks through building a search endpoint from scratch, covering input validation, response shaping, error handling, and common enhancements. By the end, you'll have a search API ready for production.

The minimal version

Let's start with the simplest possible search endpoint:

// app/api/search/route.ts (Next.js)
import { createUnragEngine } from "@unrag/config";

export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get("q") ?? "";

  const engine = createUnragEngine();
  const result = await engine.retrieve({ query, topK: 10 });

  return Response.json(result);
}

This works, but it's not production-ready. Let's improve it step by step.

Input validation

Never trust user input. Query strings can be empty, too long, or malicious:

const MAX_QUERY_LENGTH = 500;
const MIN_QUERY_LENGTH = 2;

export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const rawQuery = searchParams.get("q") ?? "";
  const query = rawQuery.trim();

  // Validate query exists
  if (!query) {
    return Response.json(
      { error: "Query parameter 'q' is required" },
      { status: 400 }
    );
  }

  // Validate minimum length
  if (query.length < MIN_QUERY_LENGTH) {
    return Response.json(
      { error: `Query must be at least ${MIN_QUERY_LENGTH} characters` },
      { status: 400 }
    );
  }

  // Validate maximum length
  if (query.length > MAX_QUERY_LENGTH) {
    return Response.json(
      { error: `Query cannot exceed ${MAX_QUERY_LENGTH} characters` },
      { status: 400 }
    );
  }

  // ... rest of handler
}

The maximum length prevents abuse—embedding extremely long queries wastes API calls and can degrade results.

Shaping the response

The raw Unrag response includes internal details (document IDs, indices) that your frontend probably doesn't need. Create a cleaner response shape:

export async function GET(request: Request) {
  // ... validation ...

  const engine = createUnragEngine();
  const result = await engine.retrieve({ query, topK: 10 });

  return Response.json({
    query,
    results: result.chunks.map((chunk) => ({
      id: chunk.id,
      content: chunk.content,
      source: chunk.sourceId,
      score: chunk.score,
      // Include useful metadata
      ...(chunk.metadata.title && { title: chunk.metadata.title }),
      ...(chunk.metadata.url && { url: chunk.metadata.url }),
    })),
    meta: {
      totalResults: result.chunks.length,
      searchTimeMs: result.durations.totalMs,
    },
  });
}

This response is frontend-friendly: relevant fields, predictable structure, and no internal implementation details leaking through.

Adding scope support

Let users search within specific collections:

export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get("q")?.trim() ?? "";
  const collection = searchParams.get("collection");
  const topK = Math.min(parseInt(searchParams.get("limit") ?? "10"), 50);

  // ... validation ...

  const engine = createUnragEngine();
  
  const result = await engine.retrieve({
    query,
    topK,
    scope: collection ? { sourceId: collection } : undefined,
  });

  return Response.json({
    query,
    collection: collection ?? "all",
    results: result.chunks.map(/* ... */),
  });
}

Now /api/search?q=auth&collection=docs: searches only documentation, while /api/search?q=auth searches everything.

Error handling

Wrap everything in try-catch and return meaningful errors:

export async function GET(request: Request) {
  try {
    // ... validation and retrieval ...
    return Response.json({ query, results: /* ... */ });
  } catch (error) {
    console.error("Search error:", error);

    // Handle specific error types
    if (error.message?.includes("rate limit")) {
      return Response.json(
        { error: "Search is temporarily rate limited. Please try again." },
        { status: 429 }
      );
    }

    if (error.message?.includes("connection")) {
      return Response.json(
        { error: "Database temporarily unavailable" },
        { status: 503 }
      );
    }

    // Generic error for unexpected failures
    return Response.json(
      { error: "Search failed. Please try again." },
      { status: 500 }
    );
  }
}

Never expose internal error messages to users—they might contain sensitive information like connection strings or query details.

The complete endpoint

Here's everything combined:

// app/api/search/route.ts
import { createUnragEngine } from "@unrag/config";

const MAX_QUERY_LENGTH = 500;
const MIN_QUERY_LENGTH = 2;
const MAX_RESULTS = 50;
const DEFAULT_RESULTS = 10;

export async function GET(request: Request) {
  try {
    const { searchParams } = new URL(request.url);
    
    // Parse parameters
    const rawQuery = searchParams.get("q") ?? "";
    const query = rawQuery.trim();
    const collection = searchParams.get("collection") ?? undefined;
    const limit = Math.min(
      Math.max(parseInt(searchParams.get("limit") ?? String(DEFAULT_RESULTS)), 1),
      MAX_RESULTS
    );

    // Validate query
    if (!query) {
      return Response.json(
        { error: "Query parameter 'q' is required" },
        { status: 400 }
      );
    }

    if (query.length < MIN_QUERY_LENGTH) {
      return Response.json(
        { error: `Query must be at least ${MIN_QUERY_LENGTH} characters` },
        { status: 400 }
      );
    }

    if (query.length > MAX_QUERY_LENGTH) {
      return Response.json(
        { error: `Query cannot exceed ${MAX_QUERY_LENGTH} characters` },
        { status: 400 }
      );
    }

    // Execute search
    const engine = createUnragEngine();
    const result = await engine.retrieve({
      query,
      topK: limit,
      scope: collection ? { sourceId: collection } : undefined,
    });

    // Format response
    return Response.json({
      query,
      collection: collection ?? null,
      results: result.chunks.map((chunk) => ({
        id: chunk.id,
        content: chunk.content,
        source: chunk.sourceId,
        score: chunk.score,
        metadata: chunk.metadata,
      })),
      meta: {
        totalResults: result.chunks.length,
        searchTimeMs: Math.round(result.durations.totalMs),
        embeddingTimeMs: Math.round(result.durations.embeddingMs),
      },
    });
  } catch (error) {
    console.error("Search error:", error);

    if (error.message?.includes("rate limit")) {
      return Response.json(
        { error: "Rate limited. Please try again shortly." },
        { status: 429 }
      );
    }

    return Response.json(
      { error: "Search failed. Please try again." },
      { status: 500 }
    );
  }
}

Adding reranking for better precision

Vector similarity search is fast but imprecise. The first result isn't always the most relevant—it might be third or seventh. Reranking fixes this by adding a second stage that reorders results using a more expensive relevance model.

If you've installed the reranker battery, add reranking to your search endpoint:

// Execute search with reranking
const engine = createUnragEngine();

// Step 1: Retrieve more candidates than we need
const retrieved = await engine.retrieve({
  query,
  topK: 30, // Retrieve more for reranking
  scope: collection ? { sourceId: collection } : undefined,
});

// Step 2: Rerank to get the best results
const reranked = await engine.rerank({
  query,
  candidates: retrieved.chunks,
  topK: limit,
  onMissingReranker: "skip", // Graceful fallback if reranker not configured
});

// Use reranked.chunks instead of retrieved.chunks
return Response.json({
  query,
  results: reranked.chunks.map(/* ... */),
  meta: {
    totalResults: reranked.chunks.length,
    searchTimeMs: Math.round(retrieved.durations.totalMs),
    rerankTimeMs: Math.round(reranked.durations.rerankMs),
    reranked: reranked.meta.rerankerName !== "none",
  },
});

The pattern is straightforward: retrieve more candidates than you need (20-50), then rerank down to your target count. This typically adds 100-300ms of latency but significantly improves result quality.

See the Reranker documentation for installation and configuration details.

Enhancements to consider

Once your basic search works, you might want to add:

Reranking: Improve precision by reordering results with a more expensive relevance model. See the section above and the Reranker documentation.

Caching: Cache embedding results for repeated queries. The embedding call is usually the slowest part.

Rate limiting: Protect your embedding API costs by limiting requests per user or IP.

Analytics: Log queries to understand what users are searching for. This helps you improve content and identify gaps.

Highlighting: Return which parts of the content matched the query. This requires additional processing but improves UX.

Facets: If your content has categories or tags, return counts of results per facet so users can filter.

Autocomplete: Build a separate endpoint that suggests queries based on past searches or content titles.

Each enhancement adds complexity, so add them as your needs require rather than all at once.

For deeper coverage of retrieval strategies—including hybrid retrieval, query rewriting, and performance optimization—see Module 4: Retrieval in the RAG Handbook.

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.