Multimodal Ingestion

Traditional RAG systems work with text: you chunk documents, embed the chunks, and retrieve similar text. But real-world content isn't just text. Your Notion pages have embedded PDFs. Your documentation includes diagrams. Your knowledge base might contain scanned contracts, audio transcripts, or product screenshots.

Unrag handles this through multimodal ingestion—a system that turns rich media into chunks that live alongside your text in a unified embedding space.

The easiest way to enable multimodal capabilities is during setup. Run bunx unrag@latest init --rich-media and the CLI will configure everything—multimodal embeddings, extractors, and the right assetProcessing flags. If you've already run init, you can re-run it with --rich-media to add multimodal support.

The unified space model

The key insight behind Unrag's multimodal support is simple: everything becomes chunks with embeddings. Whether the original content was text, a PDF, or an image, the end result is the same—vectors in your Postgres database that can be retrieved with text queries.

This means:

A text query like "what are the Q3 revenue figures?" can retrieve chunks from a PDF financial report
A question about "the architecture diagram" can find an embedded image with that caption
Your retrieval code doesn't need to know whether a chunk came from text or extracted PDF content

The alternative—separate embedding spaces for different modalities—would require multiple searches and complex result merging. Unrag keeps it simple: one space, one query, mixed results.

How assets flow through the pipeline

When you call engine.ingest(), you can include an assets array alongside your text content:

await engine.ingest({
  sourceId: "docs:quarterly-report",
  content: "Q3 2024 Financial Summary...", // Text content
  assets: [
    {
      assetId: "pdf-1",
      kind: "pdf",
      data: { kind: "url", url: "https://..." },
    },
    {
      assetId: "chart-1", 
      kind: "image",
      data: { kind: "url", url: "https://..." },
      text: "Revenue growth chart showing 15% YoY increase",
    },
  ],
});

The ingest pipeline processes these in order:

Text chunking: Your content string is chunked and embedded as usual
Asset processing: Each asset is processed based on its kind and your assetProcessing config
Embedding: All resulting chunks (text + extracted/embedded assets) go through the embedding provider
Storage: Everything lands in your database with metadata indicating the source

Connectors like Notion and Google Drive handle asset extraction automatically—they emit assets for you based on the content they encounter (images, PDFs, file embeds, etc.).

Text extraction vs direct embedding

Not all assets are handled the same way:

PDFs are turned into text through LLM extraction. By default, Unrag sends the PDF to Gemini (via Vercel AI Gateway) with a prompt asking it to extract all readable text. The extracted text is then chunked and embedded like any other text content.

Images can go two routes:

If your embedding provider supports multimodal (you've set type: 'multimodal'), the image is embedded directly. This means text queries can semantically match against image content.
If not, Unrag falls back to embedding any caption or alt text provided with the image.

Audio, video, and generic files are not extracted in v1. By default, they're skipped during ingestion. You can configure Unrag to fail ingestion if an unsupported asset is encountered, which is useful for ensuring you don't silently miss content.

When to use multimodal

Multimodal ingestion shines when your content includes:

Documentation with embedded PDFs — API specs, design docs, legal agreements
Knowledge bases with diagrams — architecture diagrams, flowcharts, screenshots
Notion workspaces — where people embed all kinds of media alongside text
Research or reports — where figures and tables are as important as prose

If your content is pure text (markdown files, plain articles), you don't need to think about this—the text-only path works exactly as before.

What's supported in v1

Asset Kind	Extraction Method	Status
`pdf`	LLM extraction (Gemini default)	Supported, opt-in
`image`	Direct embedding or caption fallback	Supported
`audio`	Not extracted	Skipped by default
`video`	Not extracted	Skipped by default
`file`	Not extracted	Skipped by default

The onUnsupportedAsset config controls what happens when Unrag encounters an asset kind it can't process: "skip" (default) continues without that asset, "fail" throws an error.

Cost considerations

PDF extraction uses LLM API calls, which have costs. A few things to keep in mind:

Rich media is opt-in — if you run init without --rich-media, the generated config is cost-safe: text-only embeddings, no extractors registered, all assetProcessing flags disabled.
Limits are configurable — you can set maxBytes to skip large PDFs, or maxOutputChars to cap extraction length.
Per-ingest overrides — you can disable extraction for bulk imports and enable it for important documents.

For most use cases, the cost is reasonable—a typical PDF extraction is comparable to a short chat completion. But if you're ingesting thousands of PDFs, plan accordingly.

When you run init --rich-media, the CLI enables multimodal embeddings and configures the extractors you selected. If you want to start cost-safe and add rich media later, just run init without the flag—you can always re-run with --rich-media later.

Multimodal Ingestion

The unified space model

How assets flow through the pipeline

Text extraction vs direct embedding

When to use multimodal

What's supported in v1

Cost considerations

Next steps

Ingest Rich Media

Configure Asset Processing

Multimodal Embeddings

On this page

Complete RAG Handbook