Asset Processing Reference

This page documents the types and configuration options for Unrag's asset processing system—how PDFs, images, and other media are handled during ingestion.

Skips and warnings (don’t miss content silently)

When assetProcessing.onUnsupportedAsset or assetProcessing.onError are set to "skip" (the defaults), Unrag will continue ingestion even if some assets cannot be processed.

To ensure you don’t silently miss content, engine.ingest() returns structured warnings:

const result = await engine.ingest(input);
if (result.warnings.length) {
  console.warn("unrag ingest warnings", result.warnings);
}

Common warning causes:

Assets skipped because extraction is disabled by config (e.g. audio.transcription.enabled: false)
PDFs skipped because all PDF strategies are disabled (e.g. pdf.llmExtraction.enabled: false and no other PDF extractor enabled)
Image skipped because no multimodal embedding provider is configured and there’s no caption text
Best-effort failures when onError: "skip" is set

For the full warning type, see IngestWarning.

If you run Unrag on Next.js (especially Vercel serverless), prefer running extraction in background jobs with retries.\nSee the Next.js Production Recipe.

AssetProcessingEvent

assetProcessing.hooks.onEvent receives a structured union type you can log/pipe into metrics.

export type AssetProcessingEvent =
  | {
      type: "asset:start";
      sourceId: string;
      documentId: string;
      assetId: string;
      assetKind: AssetKind;
      assetUri?: string;
      assetMediaType?: string;
    }
  | ({ type: "asset:skipped"; sourceId: string; documentId: string } & IngestWarning)
  | {
      type: "extractor:start";
      sourceId: string;
      documentId: string;
      assetId: string;
      assetKind: AssetKind;
      extractor: string;
    }
  | {
      type: "extractor:success";
      sourceId: string;
      documentId: string;
      assetId: string;
      assetKind: AssetKind;
      extractor: string;
      durationMs: number;
      textItemCount: number;
    }
  | {
      type: "extractor:error";
      sourceId: string;
      documentId: string;
      assetId: string;
      assetKind: AssetKind;
      extractor: string;
      durationMs: number;
      errorMessage: string;
    };

FetchConfig

Controls how Unrag fetches assets when they're provided as URLs. This applies to:

Extractors: PDFs, audio, video, and other files fetched before extraction
Image embedding: When using a multimodal embedding provider, image URLs are fetched server-side before being passed to the embedding model

This ensures URL-based assets go through consistent security policies (HTTPS-only, allowlist, timeouts) regardless of how they're processed.

type FetchConfig = {
  enabled: boolean;
  maxBytes: number;
  timeoutMs: number;
  allowedHosts?: string[];
  headers?: Record<string, string>;
};

Prop

Type

For production deployments, always configure allowedHosts to prevent server-side request forgery (SSRF). Only allow hosts you trust, such as your CDN or Notion's asset domains.

Image embedding security: When multimodal embedding is enabled, image URLs are fetched server-side using these same settings—the URL is never passed directly to the embedding provider. This prevents internal/signed URLs from being leaked to third-party APIs.

PdfLlmExtractionConfig

Settings for extracting text from PDFs using an LLM.

type PdfLlmExtractionConfig = {
  enabled: boolean;
  model: string;
  prompt: string;
  timeoutMs: number;
  maxBytes: number;
  maxOutputChars: number;
};

Prop

Type

PdfTextLayerConfig

Settings for extracting embedded text from PDFs (fast/cheap).

See pdf:text-layer.

type PdfTextLayerConfig = {
  enabled: boolean;
  maxBytes: number;
  maxOutputChars: number;
  minChars: number;
  maxPages?: number;
};

PdfOcrConfig

Settings for OCR-based PDF extraction (worker-only).

See pdf:ocr.

type PdfOcrConfig = {
  enabled: boolean;
  maxBytes: number;
  maxOutputChars: number;
  minChars: number;
  maxPages?: number;
  pdftoppmPath?: string;
  tesseractPath?: string;
  dpi?: number;
  lang?: string;
};

ImageOcrConfig

Settings for OCR-ing images into text chunks.

See image:ocr.

type ImageOcrConfig = {
  enabled: boolean;
  model: string;
  prompt: string;
  timeoutMs: number;
  maxBytes: number;
  maxOutputChars: number;
};

ImageCaptionLlmConfig

Settings for generating captions for images.

See image:caption-llm.

type ImageCaptionLlmConfig = {
  enabled: boolean;
  model: string;
  prompt: string;
  timeoutMs: number;
  maxBytes: number;
  maxOutputChars: number;
};

AudioTranscriptionConfig

Settings for transcribing audio.

See audio:transcribe.

type AudioTranscriptionConfig = {
  enabled: boolean;
  model: string;
  timeoutMs: number;
  maxBytes: number;
};

VideoTranscriptionConfig

Settings for transcribing video (audio track).

See video:transcribe.

type VideoTranscriptionConfig = {
  enabled: boolean;
  model: string;
  timeoutMs: number;
  maxBytes: number;
};

VideoFramesConfig

Settings for sampling video frames and extracting text per frame (worker-only).

See video:frames.

type VideoFramesConfig = {
  enabled: boolean;
  sampleFps: number;
  maxFrames: number;
  ffmpegPath?: string;
  maxBytes: number;
  model: string;
  prompt: string;
  timeoutMs: number;
  maxOutputChars: number;
};

File*Config

Settings for extracting text from generic file attachments.

See File Extractors for format-specific behavior.

type FileTextConfig = { enabled: boolean; maxBytes: number; maxOutputChars: number; minChars: number };
type FileDocxConfig = { enabled: boolean; maxBytes: number; maxOutputChars: number; minChars: number };
type FilePptxConfig = { enabled: boolean; maxBytes: number; maxOutputChars: number; minChars: number };
type FileXlsxConfig = { enabled: boolean; maxBytes: number; maxOutputChars: number; minChars: number };

AssetInput

The type for individual assets passed to engine.ingest().

type AssetInput = {
  /** Stable identifier for this asset within the document. */
  assetId: string;

  /** The kind of asset. Determines how it's processed. */
  kind: AssetKind;

  /** The asset data—either a URL to fetch or raw bytes. */
  data: AssetData;

  /** Optional stable URI for display/debugging (not necessarily fetchable). */
  uri?: string;

  /** Caption, alt text, or other text representation of the asset. */
  text?: string;

  /** Additional metadata to attach to chunks created from this asset. */
  metadata?: Metadata;
};

Prop

Type

AssetData

The data payload for an asset—either a URL or raw bytes.

type AssetData =
  | {
      kind: "url";
      url: string;
      headers?: Record<string, string>;
      mediaType?: string;
      filename?: string;
    }
  | {
      kind: "bytes";
      bytes: Uint8Array;
      mediaType: string;
      filename?: string;
    };

URL-based data is fetched during ingestion (subject to fetch config). This is common for connectors like Notion where assets are hosted externally. The Google Drive connector downloads files directly and uses bytes-based data instead.

Bytes-based data is used directly without fetching. Use this when you already have the file contents in memory.

AssetKind

The supported asset types:

type AssetKind = "image" | "pdf" | "audio" | "video" | "file";

Kind	v1 Support	Notes
`image`	Supported	Direct embedding (multimodal) or caption fallback
`pdf`	Supported	Requires at least one enabled PDF extractor (`pdf:text-layer`, `pdf:llm`, or `pdf:ocr`)
`audio`	Supported	Requires installing `audio-transcribe` and enabling `audio.transcription.enabled`
`video`	Supported	Requires installing `video-transcribe` and/or `video-frames` and enabling config
`file`	Supported	Requires installing file extractors and enabling config (text/docx/pptx/xlsx)

Per-ingest overrides

You can override asset processing settings per-ingest using DeepPartial<AssetProcessingConfig>:

await engine.ingest({
  sourceId: "important-doc",
  content: "...",
  assets: [...],
  assetProcessing: {
    // Only override what you need—everything else uses engine defaults
    pdf: {
      textLayer: {
        enabled: true,
      },
      llmExtraction: {
        enabled: true,
        maxBytes: 50 * 1024 * 1024, // Allow larger PDFs for this doc
      },
    },
    audio: {
      transcription: {
        enabled: true,
      },
    },
    onError: "fail", // Fail this ingest if any asset errors
  },
});

This is useful for:

Enabling extraction for specific high-value documents
Disabling extraction during bulk imports to save cost
Switching to strict mode (onError: "fail") for critical content

Default configuration

The library defaults (cost-safe, conservative):

const DEFAULT_ASSET_PROCESSING: AssetProcessingConfig = {
  onUnsupportedAsset: "skip",
  onError: "skip",
  concurrency: 4,
  hooks: { onEvent: undefined },
  fetch: {
    enabled: true,
    maxBytes: 15 * 1024 * 1024,
    timeoutMs: 20_000,
    allowedHosts: undefined,
    headers: undefined,
  },
  pdf: {
    textLayer: {
      enabled: false,
      maxBytes: 15 * 1024 * 1024,
      maxOutputChars: 200_000,
      minChars: 200,
      maxPages: undefined,
    },
    llmExtraction: {
      enabled: false, // Off by default in library
      model: "google/gemini-2.0-flash",
      prompt: "Extract all readable text from this PDF...",
      timeoutMs: 60_000,
      maxBytes: 15 * 1024 * 1024,
      maxOutputChars: 200_000,
    },
    ocr: {
      enabled: false,
      maxBytes: 15 * 1024 * 1024,
      maxOutputChars: 200_000,
      minChars: 200,
      maxPages: undefined,
      pdftoppmPath: undefined,
      tesseractPath: undefined,
      dpi: 200,
      lang: "eng",
    },
  },
  image: {
    ocr: {
      enabled: false,
      model: "google/gemini-2.0-flash",
      prompt: "Extract all readable text from this image...",
      timeoutMs: 60_000,
      maxBytes: 10 * 1024 * 1024,
      maxOutputChars: 50_000,
    },
    captionLlm: {
      enabled: false,
      model: "google/gemini-2.0-flash",
      prompt: "Write a concise, information-dense caption for this image...",
      timeoutMs: 60_000,
      maxBytes: 10 * 1024 * 1024,
      maxOutputChars: 10_000,
    },
  },
  audio: {
    transcription: {
      enabled: false,
      model: "openai/whisper-1",
      timeoutMs: 120_000,
      maxBytes: 25 * 1024 * 1024,
    },
  },
  video: {
    transcription: {
      enabled: false,
      model: "openai/whisper-1",
      timeoutMs: 120_000,
      maxBytes: 50 * 1024 * 1024,
    },
    frames: {
      enabled: false,
      sampleFps: 0.2,
      maxFrames: 50,
      ffmpegPath: undefined,
      maxBytes: 50 * 1024 * 1024,
      model: "google/gemini-2.0-flash",
      prompt: "Extract all readable text from this video frame...",
      timeoutMs: 60_000,
      maxOutputChars: 50_000,
    },
  },
  file: {
    text: { enabled: false, maxBytes: 5 * 1024 * 1024, maxOutputChars: 200_000, minChars: 50 },
    docx: { enabled: false, maxBytes: 15 * 1024 * 1024, maxOutputChars: 200_000, minChars: 50 },
    pptx: { enabled: false, maxBytes: 30 * 1024 * 1024, maxOutputChars: 200_000, minChars: 50 },
    xlsx: { enabled: false, maxBytes: 30 * 1024 * 1024, maxOutputChars: 200_000, minChars: 50 },
  },
};

The generated unrag.config.ts enables pdf.llmExtraction.enabled by default, but you still need to install and register a PDF extractor module (like pdf-llm) to actually process PDFs.

Asset Processing Reference

Skips and warnings (don’t miss content silently)

AssetProcessingConfig

AssetProcessingEvent

FetchConfig

PdfLlmExtractionConfig

PdfTextLayerConfig

PdfOcrConfig

ImageOcrConfig

ImageCaptionLlmConfig

AudioTranscriptionConfig

VideoTranscriptionConfig

VideoFramesConfig

File*Config

AssetInput

AssetData

AssetKind

Per-ingest overrides

Default configuration

On this page

Complete RAG Handbook