pdf:text-layer Extractor

The pdf:text-layer extractor pulls text directly from a PDF's embedded text layer. When a document is "born digital"—created in Word, Google Docs, or any text-aware application—that text is stored in the PDF and can be extracted without any OCR or AI processing.

This is the fastest and cheapest PDF extraction method. There are no API calls, no external dependencies, and it runs entirely in JavaScript. The tradeoff is that it only works well for PDFs that actually have embedded text.

When text-layer works well

Text layer extraction is ideal for born-digital documents: reports, proposals, contracts, articles generated from text applications. These have clean text layers that extract quickly and accurately.

It doesn't work for scanned documents or image-based PDFs where pages are essentially photographs. There's no text layer to extract. Some PDFs have partial text layers—a watermark or header but not the body content. These produce sparse output.

Complex layouts can also cause problems. Multi-column documents, sidebars, and text boxes may extract in unexpected reading order. The text layer doesn't guarantee logical reading sequence.

Installation

The easiest way to install this extractor is during setup:

bunx unrag@latest init --rich-media

Select pdf-text-layer from the list and the CLI handles everything. This is the default PDF extractor selected by the CLI's preset.

Manual installation

bunx unrag@latest add extractor pdf-text-layer

Then register in your config and enable the assetProcessing.pdf.textLayer.enabled flag:

import { createPdfTextLayerExtractor } from "./lib/unrag/extractors/pdf-text-layer";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createPdfTextLayerExtractor()],
  },
} as const);

Configuration

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    assetProcessing: {
      pdf: {
        textLayer: {
          enabled: true,
          maxBytes: 15 * 1024 * 1024,
          maxOutputChars: 200_000,
          minChars: 200,
        },
      },
    },
  },
} as const);

maxBytes sets the maximum PDF file size to process.

maxOutputChars truncates extracted text beyond this length.

minChars is the minimum character count to consider extraction successful. Below this threshold, the extractor reports failure so the next extractor in the chain can try. This is how you implement fallback to LLM extraction for scanned documents.

The minChars threshold

The minChars setting is key to fallback behavior. A scanned PDF might have a few dozen characters of embedded text—a watermark, page numbers, a header line. Without minChars, this would count as "successful" extraction, and you'd end up with a chunk containing just "CONFIDENTIAL" and some page numbers.

Setting minChars: 200 means the extractor only claims success when it finds meaningful content. Documents with sparse text layers fall through to the LLM extractor.

The right threshold depends on your documents. For typical business documents, 200 characters works well. If you're processing shorter documents (single-page forms, for example), you might lower it.

Usage with fallbacks

The most common pattern pairs pdf:text-layer with pdf:llm:

extractors: [
  createPdfTextLayerExtractor(),  // Try text layer first
  createPdfLlmExtractor(),        // Fall back to LLM
],

Digital PDFs extract instantly. Scanned or complex PDFs fall through to LLM. You get fast extraction when possible, quality when needed.

Usage example

import { readFile } from "node:fs/promises";

const pdfBytes = await readFile("./documents/quarterly-report.pdf");

await engine.ingest({
  sourceId: "reports:q3-2024",
  content: "Q3 2024 Quarterly Report",
  assets: [
    {
      assetId: "q3-report-pdf",
      kind: "pdf",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(pdfBytes),
        mediaType: "application/pdf",
      },
    },
  ],
});

Troubleshooting

If extraction produces empty or garbled output, open the PDF in a reader and try to select and copy text. If you can't select text, the PDF doesn't have a text layer and needs LLM or OCR extraction.

If text extracts but in wrong order (paragraphs jumbled, columns mixed), the PDF has a text layer but with non-linear reading order. Complex layouts benefit from LLM extraction, which "reads" the document visually and interprets structure.

If extraction succeeds but produces very little text despite a lengthy document, you may have hit a PDF with a partial text layer. Adjust minChars to ensure fallback kicks in.