Multimodal Embeddings
Embed images directly alongside text in the same vector space.
By default, Unrag's embedding providers handle text only. But some embedding models can embed both text and images into the same vector space—meaning a text query can semantically match image content.
This page explains how multimodal embeddings work and how to enable them.
What multimodal means
In a multimodal embedding space:
- Text is embedded as usual (query strings, document chunks)
- Images are embedded directly from their pixels
- Both live in the same vector space with the same dimensions
This means a query like "architecture diagram" can retrieve an actual architecture diagram image, not just text that mentions one. The embedding model understands the semantic content of images.
Enabling multimodal mode
Currently, Voyage AI is the only built-in provider with multimodal support. Configure it with type: "multimodal":
import { defineUnragConfig } from "./lib/unrag/core";
export const unrag = defineUnragConfig({
// ...
embedding: {
provider: "voyage",
config: {
type: "multimodal",
model: "voyage-multimodal-3",
timeoutMs: 30_000,
},
},
} as const);With multimodal enabled, the provider exposes an embedImage function that the ingest pipeline uses for image assets.
Which models support multimodal
Not all embedding models support images. Among Unrag's built-in providers, only Voyage currently offers multimodal embedding:
| Provider | Model | Multimodal |
|---|---|---|
| Voyage | voyage-multimodal-3 | ✓ |
| OpenAI | text-embedding-3-small | — |
| OpenAI | text-embedding-3-large | — |
| gemini-embedding-001 | — | |
| Cohere | embed-english-v3.0 | — |
If you need multimodal embeddings with a different provider, you can implement a custom provider that supports the embedImage interface.
The embedding model must embed both text and images into the same space. Using different models for text and images would create incompatible embedding spaces—retrieval wouldn't work correctly.
How image embedding works
When you ingest an image asset with multimodal enabled:
- Bytes: Image bytes are passed directly to the embedding provider
- URLs: The URL is fetched server-side using
assetProcessing.fetchsettings, then the resulting bytes are passed to the provider - The provider calls the multimodal model's image embedding endpoint
- A vector is returned representing the image's semantic content
- This vector is stored alongside text chunk vectors
During retrieval, your text query is embedded and compared against all vectors—both text chunks and image embeddings.
Security note: Image URLs are never passed directly to embedding providers. Unrag fetches the bytes server-side first, which means your assetProcessing.fetch allowlist and security settings apply to image embedding—preventing internal or signed URLs from being leaked to third-party APIs.
Resolving an image result back to bytes
Retrieval returns standard chunks. For image matches, Unrag stores references to the originating asset in chunk.metadata (not the image bytes):
chunk.metadata.assetKind === "image"chunk.metadata.assetId(stable id emitted at ingest time)- optional
chunk.metadata.assetUri/chunk.metadata.assetMediaType chunk.metadata.extractor === "image:embed"
chunk.content will be the image caption/alt text (if provided) and may be an empty string (for example if you didn't provide a caption, or if you disabled storage.storeChunkContent).
To get the actual image, resolve it via your asset store:
- If you stored a URL/URI: fetch
chunk.metadata.assetUri(note: connector URLs like Notion can expire). - If you store bytes yourself: look up the image by
chunk.metadata.assetId.
For convenience, you can use getChunkAssetRef to extract a typed reference:
import { getChunkAssetRef, type ChunkAssetRef } from "@unrag/core";
for (const chunk of result.chunks) {
const ref = getChunkAssetRef(chunk);
if (ref?.assetKind === "image") {
console.log("image asset", ref.assetId, ref.assetUri);
}
}Example: resolve bytes from a retrieved asset chunk (URL-based)
This pattern works when assetUri is a stable, fetchable URL (or a signed URL that hasn't expired):
import { getChunkAssetRef, type ChunkAssetRef } from "@unrag/core";
async function fetchAssetBytes(ref: ChunkAssetRef): Promise<Uint8Array> {
if (!ref.assetUri) throw new Error(`No assetUri for assetId=${ref.assetId}`);
const res = await fetch(ref.assetUri);
if (!res.ok) {
throw new Error(`Failed to fetch asset (${res.status}) assetId=${ref.assetId}`);
}
return new Uint8Array(await res.arrayBuffer());
}
for (const chunk of result.chunks) {
const ref = getChunkAssetRef(chunk);
if (ref?.assetKind !== "image") continue;
const bytes = await fetchAssetBytes(ref);
console.log("image bytes length", bytes.length);
}Example: resolve bytes from your own blob store (assetId-based)
If you ingest images as bytes, Unrag embeds them but does not persist the bytes. To make results resolvable later, store the bytes yourself keyed by assetId (or put your blob key in assets[].metadata and read it back from chunk.metadata):
import { getChunkAssetRef } from "@unrag/core";
for (const chunk of result.chunks) {
const ref = getChunkAssetRef(chunk);
if (ref?.assetKind !== "image") continue;
// Pseudocode: implement this in your app
const bytes = await myAssetStore.getBytes(ref.assetId);
console.log("image bytes length", bytes.length);
}The image embedding interface
The multimodal provider adds an embedImage function:
type ImageEmbeddingInput = {
data: Uint8Array; // Image bytes (URLs are fetched server-side first)
mediaType?: string; // e.g., "image/jpeg"
metadata: Metadata;
assetId?: string;
sourceId: string;
documentId: string;
};
type EmbeddingProvider = {
name: string;
dimensions?: number;
embed: (input: EmbeddingInput) => Promise<number[]>;
embedImage?: (input: ImageEmbeddingInput) => Promise<number[]>;
};The ingest pipeline checks for embedImage and uses it when processing image assets. For URL-based images, Unrag fetches the bytes using assetProcessing.fetch before calling embedImage.
Customizing image embedding
For advanced use cases with Voyage's multimodal mode, you can customize how image values are formatted:
embedding: {
provider: "voyage",
config: {
type: "multimodal",
model: "voyage-multimodal-3",
// Custom formatter for image embedding values
image: {
value: (input) => ({
image: [input.data], // Provider-specific format
}),
},
},
},The default behavior works for most cases, but this escape hatch lets you adapt to API changes or special requirements.
Fallback behavior
If your embedding provider doesn't support multimodal (no embedImage function), images fall back to caption embedding:
- Unrag checks if the image has a
textfield (caption/alt text) - If present, the caption is chunked and embedded as text
- If not, the image is skipped with a warning
This means you can use a text-only embedding model and still get some value from images—as long as they have descriptive captions.
When to use text-only vs multimodal
Use text-only when:
- Your content is primarily text
- You don't need semantic image search
- You want to minimize embedding costs
- Your chosen text model produces better results for your domain
Use multimodal when:
- Your content includes diagrams, charts, or photos
- You want "show me the architecture diagram" to actually find diagrams
- Image captions aren't descriptive enough for text search
- Visual content is as important as text content
Cost considerations
Multimodal embedding models often cost more per embedding than text-only models. Consider:
- Image frequency: How many images are you ingesting?
- Query volume: Every retrieval still embeds the query as text (cheap)
- Caption quality: Could good captions give you 80% of the value at lower cost?
For many use cases, high-quality captions with a text-only model work well. Multimodal is most valuable when images contain information that captions can't capture.
Performance with images
Image embedding calls are typically slower than text embeddings, and unlike text embeddings, they don't support batching (each image requires its own API call). Unrag respects the same concurrency limit for image embeddings as for text, so your defaults.embedding.concurrency setting controls how many images embed in parallel.
If you're ingesting many images and hitting rate limits or timeouts, consider lowering concurrency:
defaults: {
embedding: {
concurrency: 2, // Conservative for image-heavy ingestion
},
},See Performance for more details on tuning embedding throughput.
Complete example
Here's a config that enables multimodal embedding with Voyage:
// unrag.config.ts
import { defineUnragConfig } from "./lib/unrag/core";
import { createDrizzleVectorStore } from "./lib/unrag/store/drizzle";
import { drizzle } from "drizzle-orm/node-postgres";
import { Pool } from "pg";
export const unrag = defineUnragConfig({
defaults: {
chunking: { chunkSize: 200, chunkOverlap: 40 },
retrieval: { topK: 8 },
},
embedding: {
provider: "voyage",
config: {
type: "multimodal",
model: "voyage-multimodal-3",
timeoutMs: 30_000,
},
},
engine: {
// ... other config
},
} as const);
export function createUnragEngine() {
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const db = drizzle(pool);
const store = createDrizzleVectorStore(db);
return unrag.createEngine({ store });
}Now when you ingest images, they're embedded directly:
await engine.ingest({
sourceId: "product:widget-x",
content: "The Widget X is our flagship product...",
assets: [
{
assetId: "hero-image",
kind: "image",
data: { kind: "url", url: "https://..." },
// Caption is optional with multimodal—the image itself is embedded
},
],
});
// Later, this query can find the image:
const result = await engine.retrieve({
query: "widget product photo",
});More on Voyage
See Voyage AI Provider for complete configuration options and available models.
