Unrag
ExtractorsPDF

pdf:llm Extractor

Extract text from PDFs using a vision-capable LLM.

The pdf:llm extractor sends PDF files to a vision-capable LLM (like Google's Gemini) to extract readable text. The LLM "reads" the PDF visually and outputs plain text or markdown, which Unrag then chunks and embeds like normal text content.

Installation

The easiest way to install this extractor is during setup:

bunx unrag@latest init --rich-media

Select pdf-llm from the list and the CLI handles everything—installation, registration, and enabling the right assetProcessing flags.

Manual installation

If you prefer to install it separately:

bunx unrag@latest add extractor pdf-llm

Then register it in your unrag.config.ts and enable the assetProcessing.pdf.llmExtraction.enabled flag:

import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";

export const unrag = defineUnragConfig({
  // ...
  engine: {
      // ...
    extractors: [createPdfLlmExtractor()],
  },
} as const);

Without installing and registering the extractor, PDF assets will be skipped during ingestion (with warnings emitted).

How it works

  1. PDF bytes are sent to the configured LLM as a file attachment
  2. The LLM processes the document and extracts text content
  3. Extracted text is chunked using your configured chunking settings
  4. Each chunk is embedded and stored with metadata.extractor: "pdf:llm"

Configuration

Enable and configure the extractor in your unrag.config.ts:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
  assetProcessing: {
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
        prompt:
          "Extract all readable text from this PDF as faithfully as possible. Preserve structure with headings and lists when obvious. Output plain text or markdown only. Do not add commentary.",
        timeoutMs: 60_000,
        maxBytes: 15 * 1024 * 1024,
        maxOutputChars: 200_000,
      },
    },
  },
  },
} as const);

Configuration options

Prop

Type

Supported models

The extractor uses the Vercel AI SDK and works with any model that supports file inputs:

ProviderModelNotes
Googlegemini-2.0-flashRecommended. Fast, good quality, handles large PDFs
Googlegemini-1.5-proHigher quality, slower, more expensive
Anthropicclaude-3-5-sonnetExcellent quality, supports PDF attachments

OpenAI models (GPT-4o, etc.) don't currently support direct PDF file inputs in the AI SDK. Use Google or Anthropic models for PDF extraction.

Usage example

Ingesting a PDF from disk

import { createUnragEngine } from "@unrag/config";
import { readFile } from "node:fs/promises";

const engine = createUnragEngine();

const pdfBytes = await readFile("./documents/contract.pdf");

const result = await engine.ingest({
  sourceId: "contracts:2024-001",
  content: "", // No text content, just the PDF
  assets: [
    {
      assetId: "contract-pdf",
      kind: "pdf",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(pdfBytes),
        mediaType: "application/pdf",
        filename: "contract.pdf",
      },
    },
  ],
});

console.log(`Extracted ${result.chunkCount} chunks from PDF`);

// Check for warnings (e.g., empty extraction)
if (result.warnings.length > 0) {
  console.warn("Warnings:", result.warnings);
}

Ingesting a PDF from URL

const result = await engine.ingest({
  sourceId: "reports:quarterly-q4",
  content: "Q4 2024 Financial Report", // Optional text summary
  assets: [
    {
      assetId: "q4-report",
      kind: "pdf",
      data: {
        kind: "url",
        url: "https://example.com/reports/q4-2024.pdf",
        mediaType: "application/pdf",
      },
      uri: "https://example.com/reports/q4-2024.pdf", // Stored in metadata
    },
  ],
});

Retrieving PDF content

PDF chunks are retrieved like any other content:

const result = await engine.retrieve({
  query: "What are the payment terms?",
  topK: 10,
});

for (const chunk of result.chunks) {
  const ref = getChunkAssetRef(chunk);
  
  if (ref?.extractor === "pdf:llm") {
    console.log("From PDF:", chunk.content);
    console.log("Asset ID:", ref.assetId);
  }
}

Handling extraction failures

When extraction fails or produces empty output, Unrag emits warnings:

const result = await engine.ingest({ ... });

for (const warning of result.warnings) {
  if (warning.code === "asset_skipped_pdf_empty_extraction") {
    console.warn(`PDF ${warning.assetId} produced no text (scanned/empty?)`);
  }
  if (warning.code === "asset_processing_error") {
    console.error(`PDF ${warning.assetId} failed: ${warning.message}`);
  }
}

Common issues

Warning codeCauseSolution
asset_skipped_pdf_llm_extraction_disabledenabled: falseSet enabled: true in config
asset_skipped_pdf_empty_extractionLLM returned no textPDF may be image-only or corrupted
asset_processing_errorLLM call failedCheck API key, model availability, file size

Cost considerations

LLM extraction incurs API costs for each PDF processed. To control costs:

  1. Pre-filter PDFs: Only ingest PDFs you actually need searchable
  2. Use efficient models: Gemini Flash is faster and cheaper than Pro
  3. Set size limits: Use maxBytes to skip oversized files
  4. Batch during off-peak: Run bulk ingestion when you can monitor costs

Customizing the prompt

The default prompt is designed for faithful text extraction. You can customize it for specific use cases:

// Extract structured data
prompt: "Extract all text from this PDF. For tables, output as markdown tables. For forms, output as key: value pairs."

// Focus on specific content
prompt: "Extract only the terms and conditions section from this contract PDF."

// Include metadata
prompt: "Extract the text and identify: document title, date, author (if visible). Format as YAML frontmatter followed by content."

Keep prompts deterministic. Avoid instructions like "summarize" or "explain" which produce inconsistent output across runs.

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.