Model Selection
Choosing an embedding model and understanding the tradeoffs.
The embedding model you choose affects retrieval quality, latency, and cost. There's no universally best model—the right choice depends on your content, query patterns, and budget. Here's how to think about the decision.
This page covers the conceptual considerations for choosing a model. For setup instructions for specific providers, see Providers.
The key tradeoffs
Dimensions vs. performance: Higher-dimensional embeddings capture more nuance but require more storage and slightly slower similarity calculations. OpenAI's text-embedding-3-small produces 1536-dimensional vectors; text-embedding-3-large produces 3072 dimensions. For most use cases, the quality difference doesn't justify doubling your storage.
Quality vs. cost: Better models cost more per embedding. If you're embedding millions of documents, the cost difference between models adds up. Start with a cheaper model and upgrade only if retrieval quality demonstrably suffers.
Latency: All cloud embedding APIs have similar latency (100-300ms per call), but some models process tokens faster than others. For real-time applications, the embedding call is usually your bottleneck.
Recommended starting points
For documentation, articles, and support content:
export const unrag = defineUnragConfig({
// ...
embedding: {
provider: "openai",
config: {
model: "text-embedding-3-small",
timeoutMs: 15_000,
},
},
} as const);This model is fast, cheap, and produces good results for most English text. It's the default for a reason.
When quality is paramount or you have multilingual content:
export const unrag = defineUnragConfig({
// ...
embedding: {
provider: "cohere",
config: {
model: "embed-multilingual-v3.0",
timeoutMs: 20_000,
},
},
} as const);Cohere's multilingual model handles multiple languages well and is optimized for retrieval tasks.
For cost control or privacy requirements, use Ollama:
export const unrag = defineUnragConfig({
// ...
embedding: {
provider: "ollama",
config: {
model: "nomic-embed-text",
timeoutMs: 30_000,
},
},
} as const);Eliminates API costs, reduces latency, and keeps data private. See Ollama Provider for setup instructions.
Keeping embeddings consistent
Critical
All vectors in your database must come from the same embedding model. You cannot mix embeddings from different models and expect similarity search to work correctly.
Different models produce vectors in different semantic spaces. A query embedded with model A cannot meaningfully compare to chunks embedded with model B. The math produces numbers, but those numbers are meaningless.
This means:
-
Pick a model and stick with it. Don't change models on a whim.
-
If you change models, re-embed everything. There's no shortcut. Delete your existing embeddings and regenerate them with the new model.
-
Store which model was used. Unrag records
embedding_dimensionin the database, and theembeddingModelfield in responses tells you which model was active.
Detecting model changes
Unrag tracks the embedding model name in every response:
const result = await engine.ingest({ sourceId: "doc-1", content: "..." });
console.log(result.embeddingModel);
// "openai:text-embedding-3-small"
const retrieved = await engine.retrieve({ query: "test" });
console.log(retrieved.embeddingModel);
// Should match!If you're switching models, verify that your retrieval is using the same model that was used for ingestion. A mismatch produces results, but they'll be essentially random rather than semantically meaningful.
Re-embedding after model changes
When you decide to switch models, the safest approach is:
- Deploy the new model to your config but don't use it yet
- Run a re-ingestion job that processes all your content with the new model
- Verify retrieval quality with test queries
- Delete old embeddings if you're confident the new ones are correct
// Re-ingestion script
async function reembed() {
const engine = createUnragEngine(); // Now using new model
// Fetch all documents from your database
const docs = await fetchAllDocuments();
for (const doc of docs) {
await engine.ingest({
sourceId: doc.sourceId,
content: doc.content,
metadata: doc.metadata,
});
console.log(`Re-embedded: ${doc.sourceId}`);
}
}For large datasets, batch this work and track progress. Re-embedding 100,000 documents takes hours and costs money—plan accordingly.
Evaluating retrieval quality
How do you know if a model change improved things? You need test queries with expected results:
const testCases = [
{ query: "how do I reset my password?", expectedSourceId: "docs:auth" },
{ query: "pricing for enterprise", expectedSourceId: "docs:pricing" },
// ... more cases
];
for (const { query, expectedSourceId } of testCases) {
const result = await engine.retrieve({ query, topK: 5 });
const found = result.chunks.some((c) => c.sourceId.includes(expectedSourceId));
console.log(`${query}: ${found ? "PASS" : "FAIL"}`);
}Build a set of representative queries and track whether the expected documents appear in results. Run this before and after model changes to quantify the impact.
Dimension truncation
Some embedding models support dimension truncation—you can request fewer dimensions to save space while retaining most of the semantic information. OpenAI's text-embedding-3 models support this through the dimensions config option:
embedding: {
provider: "openai",
config: {
model: "text-embedding-3-small",
dimensions: 512, // Truncate from 1536 to 512
},
},Truncation can reduce storage by 60-70% with a modest quality tradeoff. If you use dimension truncation, make sure your database column is sized appropriately and that you're consistent—all vectors in your store should have the same dimensionality.
Local embedding models
For cost control or privacy requirements, consider running embedding models locally with Ollama. Models like nomic-embed-text run on modest hardware and produce reasonable results:
embedding: {
provider: "ollama",
config: {
model: "nomic-embed-text",
},
},Local models eliminate API costs entirely, reduce latency (no network round-trip), and keep your data private. The quality may be lower than cloud models for some use cases, but it's often sufficient. See Ollama Provider for details.
Multimodal embedding models
When your content includes images that carry semantic meaning (diagrams, charts, screenshots), consider a multimodal embedding model. These models embed both text and images into the same vector space, allowing text queries to match image content directly.
Currently, Voyage is the only built-in provider with multimodal support:
embedding: {
provider: "voyage",
config: {
type: "multimodal",
model: "voyage-multimodal-3",
},
},Multimodal models must embed both text and images into the same semantic space. You cannot mix a text-only model for documents with a separate vision model for images—the embeddings wouldn't be comparable.
When to use multimodal
Use multimodal when:
- Your content has diagrams, charts, or photos that carry information
- You want "show me the architecture diagram" to find actual diagrams
- Image captions aren't detailed enough for text-only search
Stick with text-only when:
- Your content is primarily text
- Images are decorative (logos, memes)
- You want to minimize embedding costs
Fallback behavior
If you use a text-only embedding provider but your ingest includes images, Unrag falls back to embedding image captions. This works well when images have descriptive alt text or captions. See Multimodal Embeddings for details.
Related
- Providers - Setup instructions for each embedding provider
- Performance - How embedding affects overall performance
- Reindexing - How to safely re-embed when changing models
