Unrag
Embedding

Model Selection

Choosing an embedding model and understanding the tradeoffs.

The embedding model you choose affects retrieval quality, latency, and cost. There's no universally best model—the right choice depends on your content, query patterns, and budget. Here's how to think about the decision.

This page covers the conceptual considerations for choosing a model. For setup instructions for specific providers, see Providers.

The key tradeoffs

Dimensions vs. performance: Higher-dimensional embeddings capture more nuance but require more storage and slightly slower similarity calculations. OpenAI's text-embedding-3-small produces 1536-dimensional vectors; text-embedding-3-large produces 3072 dimensions. For most use cases, the quality difference doesn't justify doubling your storage.

Quality vs. cost: Better models cost more per embedding. If you're embedding millions of documents, the cost difference between models adds up. Start with a cheaper model and upgrade only if retrieval quality demonstrably suffers.

Latency: All cloud embedding APIs have similar latency (100-300ms per call), but some models process tokens faster than others. For real-time applications, the embedding call is usually your bottleneck.

For documentation, articles, and support content:

export const unrag = defineUnragConfig({
  // ...
  embedding: {
    provider: "openai",
    config: {
      model: "text-embedding-3-small",
      timeoutMs: 15_000,
    },
  },
} as const);

This model is fast, cheap, and produces good results for most English text. It's the default for a reason.

When quality is paramount or you have multilingual content:

export const unrag = defineUnragConfig({
  // ...
  embedding: {
    provider: "cohere",
    config: {
      model: "embed-multilingual-v3.0",
      timeoutMs: 20_000,
    },
  },
} as const);

Cohere's multilingual model handles multiple languages well and is optimized for retrieval tasks.

For cost control or privacy requirements, use Ollama:

export const unrag = defineUnragConfig({
  // ...
  embedding: {
    provider: "ollama",
    config: {
      model: "nomic-embed-text",
      timeoutMs: 30_000,
    },
  },
} as const);

Eliminates API costs, reduces latency, and keeps data private. See Ollama Provider for setup instructions.

Keeping embeddings consistent

Critical

All vectors in your database must come from the same embedding model. You cannot mix embeddings from different models and expect similarity search to work correctly.

Different models produce vectors in different semantic spaces. A query embedded with model A cannot meaningfully compare to chunks embedded with model B. The math produces numbers, but those numbers are meaningless.

This means:

  1. Pick a model and stick with it. Don't change models on a whim.

  2. If you change models, re-embed everything. There's no shortcut. Delete your existing embeddings and regenerate them with the new model.

  3. Store which model was used. Unrag records embedding_dimension in the database, and the embeddingModel field in responses tells you which model was active.

Detecting model changes

Unrag tracks the embedding model name in every response:

const result = await engine.ingest({ sourceId: "doc-1", content: "..." });
console.log(result.embeddingModel);
// "openai:text-embedding-3-small"

const retrieved = await engine.retrieve({ query: "test" });
console.log(retrieved.embeddingModel);
// Should match!

If you're switching models, verify that your retrieval is using the same model that was used for ingestion. A mismatch produces results, but they'll be essentially random rather than semantically meaningful.

Re-embedding after model changes

When you decide to switch models, the safest approach is:

  1. Deploy the new model to your config but don't use it yet
  2. Run a re-ingestion job that processes all your content with the new model
  3. Verify retrieval quality with test queries
  4. Delete old embeddings if you're confident the new ones are correct
// Re-ingestion script
async function reembed() {
  const engine = createUnragEngine(); // Now using new model
  
  // Fetch all documents from your database
  const docs = await fetchAllDocuments();
  
  for (const doc of docs) {
    await engine.ingest({
      sourceId: doc.sourceId,
      content: doc.content,
      metadata: doc.metadata,
    });
    console.log(`Re-embedded: ${doc.sourceId}`);
  }
}

For large datasets, batch this work and track progress. Re-embedding 100,000 documents takes hours and costs money—plan accordingly.

Evaluating retrieval quality

How do you know if a model change improved things? You need test queries with expected results:

const testCases = [
  { query: "how do I reset my password?", expectedSourceId: "docs:auth" },
  { query: "pricing for enterprise", expectedSourceId: "docs:pricing" },
  // ... more cases
];

for (const { query, expectedSourceId } of testCases) {
  const result = await engine.retrieve({ query, topK: 5 });
  const found = result.chunks.some((c) => c.sourceId.includes(expectedSourceId));
  console.log(`${query}: ${found ? "PASS" : "FAIL"}`);
}

Build a set of representative queries and track whether the expected documents appear in results. Run this before and after model changes to quantify the impact.

Dimension truncation

Some embedding models support dimension truncation—you can request fewer dimensions to save space while retaining most of the semantic information. OpenAI's text-embedding-3 models support this through the dimensions config option:

embedding: {
  provider: "openai",
  config: {
    model: "text-embedding-3-small",
    dimensions: 512,  // Truncate from 1536 to 512
  },
},

Truncation can reduce storage by 60-70% with a modest quality tradeoff. If you use dimension truncation, make sure your database column is sized appropriately and that you're consistent—all vectors in your store should have the same dimensionality.

Local embedding models

For cost control or privacy requirements, consider running embedding models locally with Ollama. Models like nomic-embed-text run on modest hardware and produce reasonable results:

embedding: {
  provider: "ollama",
  config: {
    model: "nomic-embed-text",
  },
},

Local models eliminate API costs entirely, reduce latency (no network round-trip), and keep your data private. The quality may be lower than cloud models for some use cases, but it's often sufficient. See Ollama Provider for details.

Multimodal embedding models

When your content includes images that carry semantic meaning (diagrams, charts, screenshots), consider a multimodal embedding model. These models embed both text and images into the same vector space, allowing text queries to match image content directly.

Currently, Voyage is the only built-in provider with multimodal support:

embedding: {
  provider: "voyage",
  config: {
    type: "multimodal",
    model: "voyage-multimodal-3",
  },
},

Multimodal models must embed both text and images into the same semantic space. You cannot mix a text-only model for documents with a separate vision model for images—the embeddings wouldn't be comparable.

When to use multimodal

Use multimodal when:

  • Your content has diagrams, charts, or photos that carry information
  • You want "show me the architecture diagram" to find actual diagrams
  • Image captions aren't detailed enough for text-only search

Stick with text-only when:

  • Your content is primarily text
  • Images are decorative (logos, memes)
  • You want to minimize embedding costs

Fallback behavior

If you use a text-only embedding provider but your ingest includes images, Unrag falls back to embedding image captions. This works well when images have descriptive alt text or captions. See Multimodal Embeddings for details.

  • Providers - Setup instructions for each embedding provider
  • Performance - How embedding affects overall performance
  • Reindexing - How to safely re-embed when changing models

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.