Unrag
Extractors

Extractors Overview

Extract text, metadata, and embeddings from rich media assets.

Extractors transform rich media (currently: PDFs and images) into searchable content. Each asset type has one or more extractors that produce text chunks or direct embeddings.

How extractors work

When you ingest content with assets, Unrag's ingest pipeline routes each asset to the appropriate extractor based on its kind. The extractor produces either:

  • Text chunks: extracted/transcribed text that's chunked and embedded like normal text
  • Direct embeddings: vector representations of the asset itself (e.g., multimodal image embeddings)

The extractor used is recorded in chunk.metadata.extractor so you can identify the source during retrieval.

Extractor metadata

Every chunk produced by an extractor includes metadata fields you can use during retrieval:

import { getChunkAssetRef } from "@unrag/core";

const ref = getChunkAssetRef(chunk);
if (ref) {
  console.log(ref.assetKind);    // "pdf" | "image"
  console.log(ref.assetId);      // stable identifier from ingest
  console.log(ref.extractor);    // "pdf:llm" | "image:embed" | "image:caption" | ...
  console.log(ref.assetUri);     // optional URL/path
  console.log(ref.assetMediaType); // optional MIME type
}

Configuring extractors

Extractors are configured via assetProcessing in your unrag.config.ts. See Asset Processing Reference for the full configuration schema.

export const unrag = defineUnragConfig({
  // ...
  engine: {
  // ...
  assetProcessing: {
    onUnsupportedAsset: "skip",
    onError: "skip",
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        // ...
      },
    },
  },
  },
} as const);

Installing extractors

The easiest way to install extractors is during setup:

bunx unrag@latest init --rich-media

This prompts you to select which extractors you want, then installs and configures them automatically. The CLI handles importing the extractors, registering them in your config, and enabling the corresponding assetProcessing flags.

If you've already run init, you can re-run with --rich-media to add extractor support. Your existing configuration is preserved.

Manual installation

If you prefer to install extractors one at a time, or want to add more after the initial setup, use the CLI's add command:

bunx unrag@latest add extractor pdf-llm

This copies the extractor source files to lib/unrag/extractors/pdf-llm/ and adds any required dependencies to your package.json.

After manual installation, you need to register the extractor in unrag.config.ts:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [
    createPdfLlmExtractor(),
    // Add more extractors here as you install them
    ],
  },
} as const);

You'll also need to enable the corresponding assetProcessing flag. For example, the pdf-llm extractor reads its settings from assetProcessing.pdf.llmExtraction:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
  assetProcessing: {
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        timeoutMs: 60_000,
        // ... other settings
      },
    },
  },
  },
} as const);

What happens without extractors?

If an asset's kind has no registered extractor:

  • The asset is skipped (by default, controlled by onUnsupportedAsset)
  • A warning is emitted in result.warnings so you can monitor for missed content
  • Other assets and text content are processed normally

This is intentional—extraction has cost and complexity implications, so you explicitly opt in.

Available extractor modules

ModuleInstall commandExtractor nameDescription
pdf-text-layerunrag@latest add extractor pdf-text-layerpdf:text-layerExtract built-in PDF text layer (fast/cheap)
pdf-llmunrag@latest add extractor pdf-llmpdf:llmExtract text from PDFs using an LLM
pdf-ocrunrag@latest add extractor pdf-ocrpdf:ocrOCR PDFs by rasterizing pages (worker-only)
image-ocrunrag@latest add extractor image-ocrimage:ocrOCR images into searchable text
image-caption-llmunrag@latest add extractor image-caption-llmimage:caption-llmGenerate image captions via LLM
audio-transcribeunrag@latest add extractor audio-transcribeaudio:transcribeTranscribe audio into text chunks
video-transcribeunrag@latest add extractor video-transcribevideo:transcribeTranscribe video audio track into text chunks
video-framesunrag@latest add extractor video-framesvideo:framesSample frames + extract text per frame (worker-only)
file-textunrag@latest add extractor file-textfile:textDecode text-ish files (txt/md/html/json/csv)
file-docxunrag@latest add extractor file-docxfile:docxExtract raw text from .docx
file-pptxunrag@latest add extractor file-pptxfile:pptxExtract slide text from .pptx
file-xlsxunrag@latest add extractor file-xlsxfile:xlsxExtract sheet content from .xlsx

Image handling (image:embed and image:caption) is built into the core engine and doesn't require an extractor module. It's controlled by your embedding provider configuration (type: "multimodal" vs type: "text"). Additional installable image extractors (image:ocr, image:caption-llm) can produce extra text chunks when enabled.

Creating custom extractors

For advanced use cases, you can create custom extractors by implementing the AssetExtractor interface:

import type { AssetExtractor } from "@unrag/core";

export function createMyExtractor(): AssetExtractor {
  return {
    name: "my:custom", // Unique identifier (stored in chunk metadata)
    
    supports: ({ asset, ctx }) => {
      // Return true if this extractor handles this asset
      return asset.kind === "audio" && ctx.assetProcessing.audio.transcription.enabled;
    },
    
    extract: async ({ asset, ctx }) => {
      // Perform extraction and return text segments
      const transcription = await transcribeAudio(asset);
      
      return {
        texts: [
          {
            label: "transcription",
            content: transcription,
          },
        ],
        diagnostics: {
          model: "whisper-large-v3",
          seconds: 12.5,
        },
      };
    },
  };
}

Then register it like any other extractor:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createPdfLlmExtractor(), createMyExtractor()],
  },
} as const);

See Core Types Reference for the full AssetExtractor interface.

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.