Extractors Overview

Extractors transform rich media (currently: PDFs and images) into searchable content. Each asset type has one or more extractors that produce text chunks or direct embeddings.

How extractors work

When you ingest content with assets, Unrag's ingest pipeline routes each asset to the appropriate extractor based on its kind. The extractor produces either:

Text chunks: extracted/transcribed text that's chunked and embedded like normal text
Direct embeddings: vector representations of the asset itself (e.g., multimodal image embeddings)

The extractor used is recorded in chunk.metadata.extractor so you can identify the source during retrieval.

Extractor metadata

Every chunk produced by an extractor includes metadata fields you can use during retrieval:

import { getChunkAssetRef } from "@unrag/core";

const ref = getChunkAssetRef(chunk);
if (ref) {
  console.log(ref.assetKind);    // "pdf" | "image"
  console.log(ref.assetId);      // stable identifier from ingest
  console.log(ref.extractor);    // "pdf:llm" | "image:embed" | "image:caption" | ...
  console.log(ref.assetUri);     // optional URL/path
  console.log(ref.assetMediaType); // optional MIME type
}

Configuring extractors

Extractors are configured via assetProcessing in your unrag.config.ts. See Asset Processing Reference for the full configuration schema.

export const unrag = defineUnragConfig({
  // ...
  engine: {
  // ...
  assetProcessing: {
    onUnsupportedAsset: "skip",
    onError: "skip",
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        // ...
      },
    },
  },
  },
} as const);

Installing extractors

The easiest way to install extractors is during setup:

bunx unrag@latest init --rich-media

This prompts you to select which extractors you want, then installs and configures them automatically. The CLI handles importing the extractors, registering them in your config, and enabling the corresponding assetProcessing flags.

If you've already run init, you can re-run with --rich-media to add extractor support. Your existing configuration is preserved.

Manual installation

If you prefer to install extractors one at a time, or want to add more after the initial setup, use the CLI's add command:

bunx unrag@latest add extractor pdf-llm

This copies the extractor source files to lib/unrag/extractors/pdf-llm/ and adds any required dependencies to your package.json.

After manual installation, you need to register the extractor in unrag.config.ts:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [
    createPdfLlmExtractor(),
    // Add more extractors here as you install them
    ],
  },
} as const);

You'll also need to enable the corresponding assetProcessing flag. For example, the pdf-llm extractor reads its settings from assetProcessing.pdf.llmExtraction:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
  assetProcessing: {
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        timeoutMs: 60_000,
        // ... other settings
      },
    },
  },
  },
} as const);

What happens without extractors?

If an asset's kind has no registered extractor:

The asset is skipped (by default, controlled by onUnsupportedAsset)
A warning is emitted in result.warnings so you can monitor for missed content
Other assets and text content are processed normally

This is intentional—extraction has cost and complexity implications, so you explicitly opt in.

Available extractor modules

Module	Install command	Extractor name	Description
`pdf-text-layer`	`unrag@latest add extractor pdf-text-layer`	`pdf:text-layer`	Extract built-in PDF text layer (fast/cheap)
`pdf-llm`	`unrag@latest add extractor pdf-llm`	`pdf:llm`	Extract text from PDFs using an LLM
`pdf-ocr`	`unrag@latest add extractor pdf-ocr`	`pdf:ocr`	OCR PDFs by rasterizing pages (worker-only)
`image-ocr`	`unrag@latest add extractor image-ocr`	`image:ocr`	OCR images into searchable text
`image-caption-llm`	`unrag@latest add extractor image-caption-llm`	`image:caption-llm`	Generate image captions via LLM
`audio-transcribe`	`unrag@latest add extractor audio-transcribe`	`audio:transcribe`	Transcribe audio into text chunks
`video-transcribe`	`unrag@latest add extractor video-transcribe`	`video:transcribe`	Transcribe video audio track into text chunks
`video-frames`	`unrag@latest add extractor video-frames`	`video:frames`	Sample frames + extract text per frame (worker-only)
`file-text`	`unrag@latest add extractor file-text`	`file:text`	Decode text-ish files (txt/md/html/json/csv)
`file-docx`	`unrag@latest add extractor file-docx`	`file:docx`	Extract raw text from `.docx`
`file-pptx`	`unrag@latest add extractor file-pptx`	`file:pptx`	Extract slide text from `.pptx`
`file-xlsx`	`unrag@latest add extractor file-xlsx`	`file:xlsx`	Extract sheet content from `.xlsx`

Image handling (image:embed and image:caption) is built into the core engine and doesn't require an extractor module. It's controlled by your embedding provider configuration (type: "multimodal" vs type: "text"). Additional installable image extractors (image:ocr, image:caption-llm) can produce extra text chunks when enabled.

Creating custom extractors

For advanced use cases, you can create custom extractors by implementing the AssetExtractor interface:

import type { AssetExtractor } from "@unrag/core";

export function createMyExtractor(): AssetExtractor {
  return {
    name: "my:custom", // Unique identifier (stored in chunk metadata)
    
    supports: ({ asset, ctx }) => {
      // Return true if this extractor handles this asset
      return asset.kind === "audio" && ctx.assetProcessing.audio.transcription.enabled;
    },
    
    extract: async ({ asset, ctx }) => {
      // Perform extraction and return text segments
      const transcription = await transcribeAudio(asset);
      
      return {
        texts: [
          {
            label: "transcription",
            content: transcription,
          },
        ],
        diagnostics: {
          model: "whisper-large-v3",
          seconds: 12.5,
        },
      };
    },
  };
}

Then register it like any other extractor:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createPdfLlmExtractor(), createMyExtractor()],
  },
} as const);

See Core Types Reference for the full AssetExtractor interface.