pdf:llm Extractor

The pdf:llm extractor sends PDF files to a vision-capable LLM (like Google's Gemini) to extract readable text. The LLM "reads" the PDF visually and outputs plain text or markdown, which Unrag then chunks and embeds like normal text content.

Installation

The easiest way to install this extractor is during setup:

bunx unrag@latest init --rich-media

Select pdf-llm from the list and the CLI handles everything—installation, registration, and enabling the right assetProcessing flags.

Manual installation

If you prefer to install it separately:

bunx unrag@latest add extractor pdf-llm

Then register it in your unrag.config.ts and enable the assetProcessing.pdf.llmExtraction.enabled flag:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
      // ...
    extractors: [createPdfLlmExtractor()],
  },
} as const);

Without installing and registering the extractor, PDF assets will be skipped during ingestion (with warnings emitted).

How it works

PDF bytes are sent to the configured LLM as a file attachment
The LLM processes the document and extracts text content
Extracted text is chunked using your configured chunking settings
Each chunk is embedded and stored with metadata.extractor: "pdf:llm"

Configuration

Enable and configure the extractor in your unrag.config.ts:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
  assetProcessing: {
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        prompt:
          "Extract all readable text from this PDF as faithfully as possible. Preserve structure with headings and lists when obvious. Output plain text or markdown only. Do not add commentary.",
        timeoutMs: 60_000,
        maxBytes: 15 * 1024 * 1024,
        maxOutputChars: 200_000,
      },
    },
  },
  },
} as const);

Configuration options

Prop

Type

Supported models

The extractor uses the Vercel AI SDK and works with any model that supports file inputs:

Provider	Model	Notes
Google	gemini-2.0-flash	Recommended. Fast, good quality, handles large PDFs
Google	gemini-1.5-pro	Higher quality, slower, more expensive
Anthropic	claude-3-5-sonnet	Excellent quality, supports PDF attachments

OpenAI models (GPT-4o, etc.) don't currently support direct PDF file inputs in the AI SDK. Use Google or Anthropic models for PDF extraction.

Usage example

Ingesting a PDF from disk

import { createUnragEngine } from "@unrag/config";
import { readFile } from "node:fs/promises";

const engine = createUnragEngine();

const pdfBytes = await readFile("./documents/contract.pdf");

const result = await engine.ingest({
  sourceId: "contracts:2024-001",
  content: "", // No text content, just the PDF
  assets: [
    {
      assetId: "contract-pdf",
      kind: "pdf",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(pdfBytes),
        mediaType: "application/pdf",
        filename: "contract.pdf",
      },
    },
  ],
});

console.log(`Extracted ${result.chunkCount} chunks from PDF`);

// Check for warnings (e.g., empty extraction)
if (result.warnings.length > 0) {
  console.warn("Warnings:", result.warnings);
}

Ingesting a PDF from URL

const result = await engine.ingest({
  sourceId: "reports:quarterly-q4",
  content: "Q4 2024 Financial Report", // Optional text summary
  assets: [
    {
      assetId: "q4-report",
      kind: "pdf",
      data: {
        kind: "url",
        url: "https://example.com/reports/q4-2024.pdf",
        mediaType: "application/pdf",
      },
      uri: "https://example.com/reports/q4-2024.pdf", // Stored in metadata
    },
  ],
});

Retrieving PDF content

PDF chunks are retrieved like any other content:

const result = await engine.retrieve({
  query: "What are the payment terms?",
  topK: 10,
});

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  
  if (ref?.extractor === "pdf:llm") {
    console.log("From PDF:", chunk.content);
    console.log("Asset ID:", ref.assetId);
  }
}

Handling extraction failures

When extraction fails or produces empty output, Unrag emits warnings:

const result = await engine.ingest({ ... });

for (const warning of result.warnings) {
  if (warning.code === "asset_skipped_pdf_empty_extraction") {
    console.warn(`PDF ${warning.assetId} produced no text (scanned/empty?)`);
  }
  if (warning.code === "asset_processing_error") {
    console.error(`PDF ${warning.assetId} failed: ${warning.message}`);
  }
}

Common issues

Warning code	Cause	Solution
`asset_skipped_pdf_llm_extraction_disabled`	`enabled: false`	Set `enabled: true` in config
`asset_skipped_pdf_empty_extraction`	LLM returned no text	PDF may be image-only or corrupted
`asset_processing_error`	LLM call failed	Check API key, model availability, file size

Cost considerations

LLM extraction incurs API costs for each PDF processed. To control costs:

Pre-filter PDFs: Only ingest PDFs you actually need searchable
Use efficient models: Gemini Flash is faster and cheaper than Pro
Set size limits: Use maxBytes to skip oversized files
Batch during off-peak: Run bulk ingestion when you can monitor costs

Customizing the prompt

The default prompt is designed for faithful text extraction. You can customize it for specific use cases:

// Extract structured data
prompt: "Extract all text from this PDF. For tables, output as markdown tables. For forms, output as key: value pairs."

// Focus on specific content
prompt: "Extract only the terms and conditions section from this contract PDF."

// Include metadata
prompt: "Extract the text and identify: document title, date, author (if visible). Format as YAML frontmatter followed by content."

Keep prompts deterministic. Avoid instructions like "summarize" or "explain" which produce inconsistent output across runs.