Multimodal Embeddings

By default, Unrag's embedding providers handle text only. But some embedding models can embed both text and images into the same vector space—meaning a text query can semantically match image content.

This page explains how multimodal embeddings work and how to enable them.

What multimodal means

In a multimodal embedding space:

Text is embedded as usual (query strings, document chunks)
Images are embedded directly from their pixels
Both live in the same vector space with the same dimensions

This means a query like "architecture diagram" can retrieve an actual architecture diagram image, not just text that mentions one. The embedding model understands the semantic content of images.

Enabling multimodal mode

Currently, Voyage AI is the only built-in provider with multimodal support. Configure it with type: "multimodal":

import { defineUnragConfig } from "./lib/unrag/core";

export const unrag = defineUnragConfig({
  // ...
  embedding: {
    provider: "voyage",
    config: {
      type: "multimodal",
      model: "voyage-multimodal-3",
      timeoutMs: 30_000,
    },
  },
} as const);

With multimodal enabled, the provider exposes an embedImage function that the ingest pipeline uses for image assets.

Which models support multimodal

Not all embedding models support images. Among Unrag's built-in providers, only Voyage currently offers multimodal embedding:

Provider	Model	Multimodal
Voyage	voyage-multimodal-3	✓
OpenAI	text-embedding-3-small	—
OpenAI	text-embedding-3-large	—
Google	gemini-embedding-001	—
Cohere	embed-english-v3.0	—

If you need multimodal embeddings with a different provider, you can implement a custom provider that supports the embedImage interface.

The embedding model must embed both text and images into the same space. Using different models for text and images would create incompatible embedding spaces—retrieval wouldn't work correctly.

How image embedding works

When you ingest an image asset with multimodal enabled:

Bytes: Image bytes are passed directly to the embedding provider
URLs: The URL is fetched server-side using assetProcessing.fetch settings, then the resulting bytes are passed to the provider
The provider calls the multimodal model's image embedding endpoint
A vector is returned representing the image's semantic content
This vector is stored alongside text chunk vectors

During retrieval, your text query is embedded and compared against all vectors—both text chunks and image embeddings.

Security note: Image URLs are never passed directly to embedding providers. Unrag fetches the bytes server-side first, which means your assetProcessing.fetch allowlist and security settings apply to image embedding—preventing internal or signed URLs from being leaked to third-party APIs.

Resolving an image result back to bytes

Retrieval returns standard chunks. For image matches, Unrag stores references to the originating asset in chunk.metadata (not the image bytes):

chunk.metadata.assetKind === "image"
chunk.metadata.assetId (stable id emitted at ingest time)
optional chunk.metadata.assetUri / chunk.metadata.assetMediaType
chunk.metadata.extractor === "image:embed"

chunk.content will be the image caption/alt text (if provided) and may be an empty string (for example if you didn't provide a caption, or if you disabled storage.storeChunkContent).

To get the actual image, resolve it via your asset store:

If you stored a URL/URI: fetch chunk.metadata.assetUri (note: connector URLs like Notion can expire).
If you store bytes yourself: look up the image by chunk.metadata.assetId.

For convenience, you can use getChunkAssetRef to extract a typed reference:

import { getChunkAssetRef, type ChunkAssetRef } from "@unrag/core";

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  if (ref?.assetKind === "image") {
    console.log("image asset", ref.assetId, ref.assetUri);
  }
}

Example: resolve bytes from a retrieved asset chunk (URL-based)

This pattern works when assetUri is a stable, fetchable URL (or a signed URL that hasn't expired):

import { getChunkAssetRef, type ChunkAssetRef } from "@unrag/core";

async function fetchAssetBytes(ref: ChunkAssetRef): Promise<Uint8Array> {
  if (!ref.assetUri) throw new Error(`No assetUri for assetId=${ref.assetId}`);
  const res = await fetch(ref.assetUri);
  if (!res.ok) {
    throw new Error(`Failed to fetch asset (${res.status}) assetId=${ref.assetId}`);
  }
  return new Uint8Array(await res.arrayBuffer());
}

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  if (ref?.assetKind !== "image") continue;
  const bytes = await fetchAssetBytes(ref);
  console.log("image bytes length", bytes.length);
}

Example: resolve bytes from your own blob store (assetId-based)

If you ingest images as bytes, Unrag embeds them but does not persist the bytes. To make results resolvable later, store the bytes yourself keyed by assetId (or put your blob key in assets[].metadata and read it back from chunk.metadata):

import { getChunkAssetRef } from "@unrag/core";

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  if (ref?.assetKind !== "image") continue;

  // Pseudocode: implement this in your app
  const bytes = await myAssetStore.getBytes(ref.assetId);
  console.log("image bytes length", bytes.length);
}

The image embedding interface

The multimodal provider adds an embedImage function:

type ImageEmbeddingInput = {
  data: Uint8Array;           // Image bytes (URLs are fetched server-side first)
  mediaType?: string;         // e.g., "image/jpeg"
  metadata: Metadata;
  assetId?: string;
  sourceId: string;
  documentId: string;
};

type EmbeddingProvider = {
  name: string;
  dimensions?: number;
  embed: (input: EmbeddingInput) => Promise<number[]>;
  embedImage?: (input: ImageEmbeddingInput) => Promise<number[]>;
};

The ingest pipeline checks for embedImage and uses it when processing image assets. For URL-based images, Unrag fetches the bytes using assetProcessing.fetch before calling embedImage.

Customizing image embedding

For advanced use cases with Voyage's multimodal mode, you can customize how image values are formatted:

embedding: {
  provider: "voyage",
  config: {
    type: "multimodal",
    model: "voyage-multimodal-3",
    
    // Custom formatter for image embedding values
    image: {
      value: (input) => ({
        image: [input.data], // Provider-specific format
      }),
    },
  },
},

The default behavior works for most cases, but this escape hatch lets you adapt to API changes or special requirements.

Fallback behavior

If your embedding provider doesn't support multimodal (no embedImage function), images fall back to caption embedding:

Unrag checks if the image has a text field (caption/alt text)
If present, the caption is chunked and embedded as text
If not, the image is skipped with a warning

This means you can use a text-only embedding model and still get some value from images—as long as they have descriptive captions.

When to use text-only vs multimodal

Use text-only when:

Your content is primarily text
You don't need semantic image search
You want to minimize embedding costs
Your chosen text model produces better results for your domain

Use multimodal when:

Your content includes diagrams, charts, or photos
You want "show me the architecture diagram" to actually find diagrams
Image captions aren't descriptive enough for text search
Visual content is as important as text content

Cost considerations

Multimodal embedding models often cost more per embedding than text-only models. Consider:

Image frequency: How many images are you ingesting?
Query volume: Every retrieval still embeds the query as text (cheap)
Caption quality: Could good captions give you 80% of the value at lower cost?

For many use cases, high-quality captions with a text-only model work well. Multimodal is most valuable when images contain information that captions can't capture.

Performance with images

Image embedding calls are typically slower than text embeddings, and unlike text embeddings, they don't support batching (each image requires its own API call). Unrag respects the same concurrency limit for image embeddings as for text, so your defaults.embedding.concurrency setting controls how many images embed in parallel.

If you're ingesting many images and hitting rate limits or timeouts, consider lowering concurrency:

defaults: {
  embedding: {
    concurrency: 2,  // Conservative for image-heavy ingestion
  },
},

See Performance for more details on tuning embedding throughput.

Complete example

Here's a config that enables multimodal embedding with Voyage:

// unrag.config.ts
import { defineUnragConfig } from "./lib/unrag/core";
import { createDrizzleVectorStore } from "./lib/unrag/store/drizzle";
import { drizzle } from "drizzle-orm/node-postgres";
import { Pool } from "pg";

export const unrag = defineUnragConfig({
  defaults: {
    chunking: { chunkSize: 200, chunkOverlap: 40 },
    retrieval: { topK: 8 },
  },
  embedding: {
    provider: "voyage",
    config: {
      type: "multimodal",
      model: "voyage-multimodal-3",
      timeoutMs: 30_000,
    },
  },
  engine: {
    // ... other config
  },
} as const);

export function createUnragEngine() {
  const pool = new Pool({ connectionString: process.env.DATABASE_URL });
  const db = drizzle(pool);
  const store = createDrizzleVectorStore(db);
  return unrag.createEngine({ store });
}

Now when you ingest images, they're embedded directly:

await engine.ingest({
  sourceId: "product:widget-x",
  content: "The Widget X is our flagship product...",
  assets: [
    {
      assetId: "hero-image",
      kind: "image",
      data: { kind: "url", url: "https://..." },
      // Caption is optional with multimodal—the image itself is embedded
    },
  ],
});

// Later, this query can find the image:
const result = await engine.retrieve({
  query: "widget product photo",
});