pdf:llm Extractor
Extract text from PDFs using a vision-capable LLM.
The pdf:llm extractor sends PDF files to a vision-capable LLM (like Google's Gemini) to extract readable text. The LLM "reads" the PDF visually and outputs plain text or markdown, which Unrag then chunks and embeds like normal text content.
Installation
The easiest way to install this extractor is during setup:
bunx unrag@latest init --rich-mediaSelect pdf-llm from the list and the CLI handles everything—installation, registration, and enabling the right assetProcessing flags.
Manual installation
If you prefer to install it separately:
bunx unrag@latest add extractor pdf-llmThen register it in your unrag.config.ts and enable the assetProcessing.pdf.llmExtraction.enabled flag:
import { createPdfLlmExtractor } from "./lib/unrag/extractors/pdf-llm";
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
extractors: [createPdfLlmExtractor()],
},
} as const);Without installing and registering the extractor, PDF assets will be skipped during ingestion (with warnings emitted).
How it works
- PDF bytes are sent to the configured LLM as a file attachment
- The LLM processes the document and extracts text content
- Extracted text is chunked using your configured chunking settings
- Each chunk is embedded and stored with
metadata.extractor: "pdf:llm"
Configuration
Enable and configure the extractor in your unrag.config.ts:
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
assetProcessing: {
pdf: {
llmExtraction: {
enabled: true,
model: "google/gemini-2.0-flash",
prompt:
"Extract all readable text from this PDF as faithfully as possible. Preserve structure with headings and lists when obvious. Output plain text or markdown only. Do not add commentary.",
timeoutMs: 60_000,
maxBytes: 15 * 1024 * 1024,
maxOutputChars: 200_000,
},
},
},
},
} as const);Configuration options
Prop
Type
Supported models
The extractor uses the Vercel AI SDK and works with any model that supports file inputs:
| Provider | Model | Notes |
|---|---|---|
| gemini-2.0-flash | Recommended. Fast, good quality, handles large PDFs | |
| gemini-1.5-pro | Higher quality, slower, more expensive | |
| Anthropic | claude-3-5-sonnet | Excellent quality, supports PDF attachments |
OpenAI models (GPT-4o, etc.) don't currently support direct PDF file inputs in the AI SDK. Use Google or Anthropic models for PDF extraction.
Usage example
Ingesting a PDF from disk
import { createUnragEngine } from "@unrag/config";
import { readFile } from "node:fs/promises";
const engine = createUnragEngine();
const pdfBytes = await readFile("./documents/contract.pdf");
const result = await engine.ingest({
sourceId: "contracts:2024-001",
content: "", // No text content, just the PDF
assets: [
{
assetId: "contract-pdf",
kind: "pdf",
data: {
kind: "bytes",
bytes: new Uint8Array(pdfBytes),
mediaType: "application/pdf",
filename: "contract.pdf",
},
},
],
});
console.log(`Extracted ${result.chunkCount} chunks from PDF`);
// Check for warnings (e.g., empty extraction)
if (result.warnings.length > 0) {
console.warn("Warnings:", result.warnings);
}Ingesting a PDF from URL
const result = await engine.ingest({
sourceId: "reports:quarterly-q4",
content: "Q4 2024 Financial Report", // Optional text summary
assets: [
{
assetId: "q4-report",
kind: "pdf",
data: {
kind: "url",
url: "https://example.com/reports/q4-2024.pdf",
mediaType: "application/pdf",
},
uri: "https://example.com/reports/q4-2024.pdf", // Stored in metadata
},
],
});Retrieving PDF content
PDF chunks are retrieved like any other content:
const result = await engine.retrieve({
query: "What are the payment terms?",
topK: 10,
});
for (const chunk of result.chunks) {
const ref = getChunkAssetRef(chunk);
if (ref?.extractor === "pdf:llm") {
console.log("From PDF:", chunk.content);
console.log("Asset ID:", ref.assetId);
}
}Handling extraction failures
When extraction fails or produces empty output, Unrag emits warnings:
const result = await engine.ingest({ ... });
for (const warning of result.warnings) {
if (warning.code === "asset_skipped_pdf_empty_extraction") {
console.warn(`PDF ${warning.assetId} produced no text (scanned/empty?)`);
}
if (warning.code === "asset_processing_error") {
console.error(`PDF ${warning.assetId} failed: ${warning.message}`);
}
}Common issues
| Warning code | Cause | Solution |
|---|---|---|
asset_skipped_pdf_llm_extraction_disabled | enabled: false | Set enabled: true in config |
asset_skipped_pdf_empty_extraction | LLM returned no text | PDF may be image-only or corrupted |
asset_processing_error | LLM call failed | Check API key, model availability, file size |
Cost considerations
LLM extraction incurs API costs for each PDF processed. To control costs:
- Pre-filter PDFs: Only ingest PDFs you actually need searchable
- Use efficient models: Gemini Flash is faster and cheaper than Pro
- Set size limits: Use
maxBytesto skip oversized files - Batch during off-peak: Run bulk ingestion when you can monitor costs
Customizing the prompt
The default prompt is designed for faithful text extraction. You can customize it for specific use cases:
// Extract structured data
prompt: "Extract all text from this PDF. For tables, output as markdown tables. For forms, output as key: value pairs."
// Focus on specific content
prompt: "Extract only the terms and conditions section from this contract PDF."
// Include metadata
prompt: "Extract the text and identify: document title, date, author (if visible). Format as YAML frontmatter followed by content."Keep prompts deterministic. Avoid instructions like "summarize" or "explain" which produce inconsistent output across runs.
