Multimodal Ingestion
How Unrag handles images, PDFs, and other rich media alongside text.
Traditional RAG systems work with text: you chunk documents, embed the chunks, and retrieve similar text. But real-world content isn't just text. Your Notion pages have embedded PDFs. Your documentation includes diagrams. Your knowledge base might contain scanned contracts, audio transcripts, or product screenshots.
Unrag handles this through multimodal ingestion—a system that turns rich media into chunks that live alongside your text in a unified embedding space.
The easiest way to enable multimodal capabilities is during setup. Run bunx unrag@latest init --rich-media and the CLI will configure everything—multimodal embeddings, extractors, and the right assetProcessing flags. If you've already run init, you can re-run it with --rich-media to add multimodal support.
The unified space model
The key insight behind Unrag's multimodal support is simple: everything becomes chunks with embeddings. Whether the original content was text, a PDF, or an image, the end result is the same—vectors in your Postgres database that can be retrieved with text queries.
This means:
- A text query like "what are the Q3 revenue figures?" can retrieve chunks from a PDF financial report
- A question about "the architecture diagram" can find an embedded image with that caption
- Your retrieval code doesn't need to know whether a chunk came from text or extracted PDF content
The alternative—separate embedding spaces for different modalities—would require multiple searches and complex result merging. Unrag keeps it simple: one space, one query, mixed results.
How assets flow through the pipeline
When you call engine.ingest(), you can include an assets array alongside your text content:
await engine.ingest({
sourceId: "docs:quarterly-report",
content: "Q3 2024 Financial Summary...", // Text content
assets: [
{
assetId: "pdf-1",
kind: "pdf",
data: { kind: "url", url: "https://..." },
},
{
assetId: "chart-1",
kind: "image",
data: { kind: "url", url: "https://..." },
text: "Revenue growth chart showing 15% YoY increase",
},
],
});The ingest pipeline processes these in order:
- Text chunking: Your
contentstring is chunked and embedded as usual - Asset processing: Each asset is processed based on its
kindand yourassetProcessingconfig - Embedding: All resulting chunks (text + extracted/embedded assets) go through the embedding provider
- Storage: Everything lands in your database with metadata indicating the source
Connectors like Notion and Google Drive handle asset extraction automatically—they emit assets for you based on the content they encounter (images, PDFs, file embeds, etc.).
Text extraction vs direct embedding
Not all assets are handled the same way:
PDFs are turned into text through LLM extraction. By default, Unrag sends the PDF to Gemini (via Vercel AI Gateway) with a prompt asking it to extract all readable text. The extracted text is then chunked and embedded like any other text content.
Images can go two routes:
- If your embedding provider supports multimodal (you've set
type: 'multimodal'), the image is embedded directly. This means text queries can semantically match against image content. - If not, Unrag falls back to embedding any caption or alt text provided with the image.
Audio, video, and generic files are not extracted in v1. By default, they're skipped during ingestion. You can configure Unrag to fail ingestion if an unsupported asset is encountered, which is useful for ensuring you don't silently miss content.
When to use multimodal
Multimodal ingestion shines when your content includes:
- Documentation with embedded PDFs — API specs, design docs, legal agreements
- Knowledge bases with diagrams — architecture diagrams, flowcharts, screenshots
- Notion workspaces — where people embed all kinds of media alongside text
- Research or reports — where figures and tables are as important as prose
If your content is pure text (markdown files, plain articles), you don't need to think about this—the text-only path works exactly as before.
What's supported in v1
| Asset Kind | Extraction Method | Status |
|---|---|---|
pdf | LLM extraction (Gemini default) | Supported, opt-in |
image | Direct embedding or caption fallback | Supported |
audio | Not extracted | Skipped by default |
video | Not extracted | Skipped by default |
file | Not extracted | Skipped by default |
The onUnsupportedAsset config controls what happens when Unrag encounters an asset kind it can't process: "skip" (default) continues without that asset, "fail" throws an error.
Cost considerations
PDF extraction uses LLM API calls, which have costs. A few things to keep in mind:
- Rich media is opt-in — if you run
initwithout--rich-media, the generated config is cost-safe: text-only embeddings, no extractors registered, allassetProcessingflags disabled. - Limits are configurable — you can set
maxBytesto skip large PDFs, ormaxOutputCharsto cap extraction length. - Per-ingest overrides — you can disable extraction for bulk imports and enable it for important documents.
For most use cases, the cost is reasonable—a typical PDF extraction is comparable to a short chat completion. But if you're ingesting thousands of PDFs, plan accordingly.
When you run init --rich-media, the CLI enables multimodal embeddings and configures the extractors you selected. If you want to start cost-safe and add rich media later, just run init without the flag—you can always re-run with --rich-media later.
