Unrag
ExtractorsImage

image:embed Extractor

Embed images directly into the vector space for visual similarity search.

The image:embed extractor sends image data to a multimodal embedding model, producing a vector that represents the image's visual content. This vector lives in the same space as text embeddings, enabling cross-modal retrieval.

How it works

  1. For bytes: Image data is passed directly to the multimodal embedding model
  2. For URLs: The URL is fetched server-side using assetProcessing.fetch settings, then the bytes are passed to the model
  3. The model returns a vector representing the image's semantic content
  4. The vector is stored alongside text chunk embeddings
  5. Text queries can match image embeddings and vice versa

Security note: Image URLs are fetched server-side before being sent to the embedding provider. This means assetProcessing.fetch.allowedHosts applies to image embedding, preventing internal or signed URLs from leaking to third-party APIs.

Requirements

You need a multimodal embedding model that supports image inputs:

import { createAiEmbeddingProvider } from "@unrag/embedding/ai";

const embedding = createAiEmbeddingProvider({
  type: "multimodal",
  model: "cohere/embed-v4.0",
  timeoutMs: 30_000,
});

Supported models

ProviderModelNotes
Cohereembed-v4.0Recommended. High quality, supports images + text
Voyagevoyage-multimodal-3Images supported in multimodal mode

The model must embed both text and images into the same vector space. Using different models for text and images would create incompatible embeddings.

Configuration

Image embedding is enabled automatically when your embedding provider supports it. No additional configuration is needed.

To explicitly use multimodal mode in your config:

export const unrag = defineUnragConfig({
  // ...
  embedding: {
    provider: "ai",
    config: {
    type: "multimodal",
    model: "cohere/embed-v4.0",
    timeoutMs: 30_000,
  },
  },
} as const);

Usage example

Ingesting images

import { createUnragEngine } from "@unrag/config";

const engine = createUnragEngine();

await engine.ingest({
  sourceId: "products:widget-x",
  content: "The Widget X is our flagship product with a sleek design.",
  assets: [
    {
      assetId: "hero-image",
      kind: "image",
      data: {
        kind: "url",
        url: "https://cdn.example.com/products/widget-x-hero.jpg",
        mediaType: "image/jpeg",
      },
      uri: "https://cdn.example.com/products/widget-x-hero.jpg",
      text: "Widget X product photo", // Optional caption (stored in chunk.content)
    },
    {
      assetId: "diagram",
      kind: "image",
      data: {
        kind: "url",
        url: "https://cdn.example.com/products/widget-x-diagram.png",
        mediaType: "image/png",
      },
      uri: "https://cdn.example.com/products/widget-x-diagram.png",
      text: "Technical diagram showing internal components",
    },
  ],
});

Ingesting images from bytes

import { readFile } from "node:fs/promises";

const imageBytes = await readFile("./images/photo.jpg");

await engine.ingest({
  sourceId: "photos:vacation-2024",
  content: "Photos from summer vacation",
  assets: [
    {
      assetId: "beach-sunset",
      kind: "image",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(imageBytes),
        mediaType: "image/jpeg",
        filename: "beach-sunset.jpg",
      },
      text: "Sunset over the ocean at Malibu beach",
    },
  ],
});

Retrieving image matches

Text queries find relevant images:

const result = await engine.retrieve({
  query: "product diagram showing components",
  topK: 10,
});

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  
  if (ref?.assetKind === "image" && ref.extractor === "image:embed") {
    console.log("Found image via embedding:");
    console.log(`  Score: ${chunk.score}`);
    console.log(`  Caption: ${chunk.content}`);
    console.log(`  Asset ID: ${ref.assetId}`);
    console.log(`  URL: ${ref.assetUri}`);
  }
}

Resolving the original image

The chunk contains references, not bytes. To get the actual image:

import { getChunkAssetRef, type ChunkAssetRef } from "@unrag/core";

async function resolveImageBytes(ref: ChunkAssetRef): Promise<Uint8Array> {
  // Option 1: Fetch from stored URI
  if (ref.assetUri) {
    const res = await fetch(ref.assetUri);
    if (!res.ok) throw new Error(`Failed to fetch image: ${res.status}`);
    return new Uint8Array(await res.arrayBuffer());
  }
  
  // Option 2: Look up from your own asset store by ID
  return await myAssetStore.getImage(ref.assetId);
}

// Usage
const ref = getChunkAssetRef(chunk);
if (ref?.assetKind === "image") {
  const bytes = await resolveImageBytes(ref);
  // Use the image bytes...
}

URL expiration: Some connectors (like Notion) provide signed URLs that expire. If you need long-term access, download and store images in your own storage during ingestion, then use assetId for resolution.

What gets stored

For each image chunk, Unrag stores:

FieldContent
chunk.contentCaption text (from assets[].text), may be empty
chunk.metadata.assetKind"image"
chunk.metadata.assetIdYour provided asset ID
chunk.metadata.assetUriURL (if provided)
chunk.metadata.assetMediaTypeMIME type (if provided)
chunk.metadata.extractor"image:embed"
embeddingVector from multimodal model

The image bytes are not stored in the database. You're responsible for storing/resolving them if needed after retrieval.

Cost considerations

Multimodal embedding typically costs more than text embedding:

  • Embed once, query many: Images are embedded at ingest time; queries are text-only (cheap)
  • Batch ingestion: Group multiple images in a single ingest call when possible
  • Caption fallback: For less important images, good captions with text embedding may suffice

Troubleshooting

Images being skipped

If images are skipped, check result.warnings:

const result = await engine.ingest({ ... });
for (const w of result.warnings) {
  if (w.code === "asset_skipped_image_no_multimodal_and_no_caption") {
    console.log(`Image ${w.assetId} skipped: no multimodal support and no caption`);
  }
  if (w.code === "asset_processing_error" && w.stage === "fetch") {
    console.log(`Image ${w.assetId} skipped: URL fetch failed`);
  }
}

Fix for "no multimodal and no caption": Either switch to a multimodal embedding model, or provide captions.

Fix for "URL fetch failed": Check your assetProcessing.fetch settings:

  • Is fetch.enabled set to true?
  • Is the image host in fetch.allowedHosts (if configured)?
  • Is the URL accessible and within fetch.maxBytes / fetch.timeoutMs limits?

Poor retrieval quality

If text queries aren't finding relevant images:

  1. Check that images and text use the same embedding model
  2. Try more specific queries
  3. Consider whether the model actually "understands" your image types
  4. Add captions to supplement visual embedding

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.