Unrag
ExtractorsVideo

video:frames Extractor

Extract frames from videos and analyze them with a vision LLM.

The video:frames extractor samples frames from video files and sends them to a vision-capable LLM for analysis. Each frame produces a text description, and those descriptions become searchable chunks. This captures visual content that speech doesn't convey—slides in presentations, code on screen, diagrams in tutorials.

This is a worker-only extractor. It requires ffmpeg for frame extraction and makes LLM calls for each frame, which takes time and costs money. Use it when visual content genuinely matters for search.

When frame analysis helps

Frame extraction shines for videos where important information appears visually. A coding tutorial where the presenter says "now let's add the function" benefits from frame analysis that captures what function is actually shown. A slide presentation where the speaker summarizes verbally but the detail is on slides benefits from capturing slide content.

For talking-head videos, interviews, or podcasts with static imagery, frame analysis adds cost without value. The transcription extractor captures the spoken content, which is what matters.

Installation

bunx unrag@latest add extractor video-frames

Register in your config:

import { createVideoFramesExtractor } from "./lib/unrag/extractors/video-frames";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createVideoFramesExtractor()],
  },
} as const);

This extractor requires ffmpeg and makes vision LLM calls for each extracted frame. It's not suitable for serverless runtimes. Run it in a worker environment with native dependency support.

How it works

The extraction process has three stages. First, ffmpeg extracts frames from the video at configured intervals—say, one frame every 30 seconds. Each frame is saved as a temporary image file. Then, each frame is sent to a vision-capable LLM (Gemini by default) with a prompt asking for a description of what's shown. Finally, those descriptions become text chunks that flow through your normal embedding pipeline.

The result is a set of chunks that describe the visual content at regular intervals through the video. A search for "error message on screen" can find the moment when an error was visible, even if the speaker didn't mention it.

Configuration

Configure via assetProcessing.video.frames:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    assetProcessing: {
      video: {
        frames: {
          enabled: true,
          model: "google/gemini-2.0-flash",
          intervalSec: 30,
          maxFrames: 60,
          maxBytes: 500 * 1024 * 1024,
          prompt: "Describe what is shown in this video frame. Focus on text, UI elements, diagrams, and any visible information. Be specific and factual.",
          ffmpegPath: "/usr/bin/ffmpeg",
        },
      },
    },
  },
} as const);

intervalSec controls how frequently to sample frames. Every 30 seconds works well for presentations and tutorials. For fast-paced content, you might sample more frequently. For slow content, less often.

maxFrames caps the total frames analyzed per video. At one frame per 30 seconds, a 60-frame limit covers 30 minutes of video. Beyond that limit, the extractor stops sampling early rather than skipping frames throughout.

prompt tells the LLM what to look for. The default prompt focuses on factual descriptions of visible content. Customize it for your content type—if you're processing code tutorials, ask specifically about code. If you're processing slide decks, focus on slide titles and bullet points.

Usage example

await engine.ingest({
  sourceId: "tutorials:react-basics",
  content: "Introduction to React Hooks",
  assets: [
    {
      assetId: "tutorial-video",
      kind: "video",
      data: {
        kind: "url",
        url: "https://storage.example.com/tutorials/react-hooks.mp4",
        mediaType: "video/mp4",
      },
    },
  ],
});

After processing, searches like "useEffect dependency array" might surface frames where that code pattern is visible, even if the instructor didn't use those exact words.

What gets stored

Each analyzed frame produces a chunk:

FieldContent
chunk.contentLLM's description of the frame
chunk.metadata.assetKind"video"
chunk.metadata.assetIdYour provided ID
chunk.metadata.extractor"video:frames"
chunk.metadata.frameIndexWhich frame (0, 1, 2, ...)

Combining with transcription

For comprehensive video search, use both extractors:

extractors: [
  createVideoTranscribeExtractor(),
  createVideoFramesExtractor(),
],

With both enabled, a video produces chunks from the spoken transcript and chunks describing the visual content. A search query might match either—finding a video because the presenter said something or because something appeared on screen.

Cost and latency

Frame extraction is expensive. Each frame requires an LLM vision call, and a 30-minute video at 30-second intervals produces 60 frames. That's 60 LLM calls per video, plus the time to process them.

For high-volume video processing, be thoughtful about which videos need frame analysis. Batch processing during off-hours helps manage costs. Using a faster, cheaper model (Gemini Flash over Pro) reduces per-call cost.

Worker setup

This extractor needs a worker environment with ffmpeg installed. A typical Docker setup:

FROM node:20-slim

RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . .
RUN npm install

CMD ["node", "worker.js"]

The worker processes video extraction jobs from a queue, running outside your main application's request/response cycle. See the Next.js Production Recipe for queue-based patterns.

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.