video:frames Extractor

The video:frames extractor samples frames from video files and sends them to a vision-capable LLM for analysis. Each frame produces a text description, and those descriptions become searchable chunks. This captures visual content that speech doesn't convey—slides in presentations, code on screen, diagrams in tutorials.

This is a worker-only extractor. It requires ffmpeg for frame extraction and makes LLM calls for each frame, which takes time and costs money. Use it when visual content genuinely matters for search.

When frame analysis helps

Frame extraction shines for videos where important information appears visually. A coding tutorial where the presenter says "now let's add the function" benefits from frame analysis that captures what function is actually shown. A slide presentation where the speaker summarizes verbally but the detail is on slides benefits from capturing slide content.

For talking-head videos, interviews, or podcasts with static imagery, frame analysis adds cost without value. The transcription extractor captures the spoken content, which is what matters.

Installation

bunx unrag@latest add extractor video-frames

import { createVideoFramesExtractor } from "./lib/unrag/extractors/video-frames";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createVideoFramesExtractor()],
  },
} as const);

This extractor requires ffmpeg and makes vision LLM calls for each extracted frame. It's not suitable for serverless runtimes. Run it in a worker environment with native dependency support.

How it works

The extraction process has three stages. First, ffmpeg extracts frames from the video at configured intervals—say, one frame every 30 seconds. Each frame is saved as a temporary image file. Then, each frame is sent to a vision-capable LLM (Gemini by default) with a prompt asking for a description of what's shown. Finally, those descriptions become text chunks that flow through your normal embedding pipeline.

The result is a set of chunks that describe the visual content at regular intervals through the video. A search for "error message on screen" can find the moment when an error was visible, even if the speaker didn't mention it.

Configuration

Configure via assetProcessing.video.frames:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    assetProcessing: {
      video: {
        frames: {
          enabled: true,
          model: "google/gemini-2.0-flash",
          intervalSec: 30,
          maxFrames: 60,
          maxBytes: 500 * 1024 * 1024,
          prompt: "Describe what is shown in this video frame. Focus on text, UI elements, diagrams, and any visible information. Be specific and factual.",
          ffmpegPath: "/usr/bin/ffmpeg",
        },
      },
    },
  },
} as const);

intervalSec controls how frequently to sample frames. Every 30 seconds works well for presentations and tutorials. For fast-paced content, you might sample more frequently. For slow content, less often.

maxFrames caps the total frames analyzed per video. At one frame per 30 seconds, a 60-frame limit covers 30 minutes of video. Beyond that limit, the extractor stops sampling early rather than skipping frames throughout.

prompt tells the LLM what to look for. The default prompt focuses on factual descriptions of visible content. Customize it for your content type—if you're processing code tutorials, ask specifically about code. If you're processing slide decks, focus on slide titles and bullet points.

Usage example

await engine.ingest({
  sourceId: "tutorials:react-basics",
  content: "Introduction to React Hooks",
  assets: [
    {
      assetId: "tutorial-video",
      kind: "video",
      data: {
        kind: "url",
        url: "https://storage.example.com/tutorials/react-hooks.mp4",
        mediaType: "video/mp4",
      },
    },
  ],
});

After processing, searches like "useEffect dependency array" might surface frames where that code pattern is visible, even if the instructor didn't use those exact words.

What gets stored

Each analyzed frame produces a chunk:

Field	Content
`chunk.content`	LLM's description of the frame
`chunk.metadata.assetKind`	`"video"`
`chunk.metadata.assetId`	Your provided ID
`chunk.metadata.extractor`	`"video:frames"`
`chunk.metadata.frameIndex`	Which frame (0, 1, 2, ...)

Combining with transcription

For comprehensive video search, use both extractors:

extractors: [
  createVideoTranscribeExtractor(),
  createVideoFramesExtractor(),
],

With both enabled, a video produces chunks from the spoken transcript and chunks describing the visual content. A search query might match either—finding a video because the presenter said something or because something appeared on screen.

Cost and latency

Frame extraction is expensive. Each frame requires an LLM vision call, and a 30-minute video at 30-second intervals produces 60 frames. That's 60 LLM calls per video, plus the time to process them.

For high-volume video processing, be thoughtful about which videos need frame analysis. Batch processing during off-hours helps manage costs. Using a faster, cheaper model (Gemini Flash over Pro) reduces per-call cost.

Worker setup

This extractor needs a worker environment with ffmpeg installed. A typical Docker setup:

FROM node:20-slim

RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY . .
RUN npm install

CMD ["node", "worker.js"]

The worker processes video extraction jobs from a queue, running outside your main application's request/response cycle. See the Next.js Production Recipe for queue-based patterns.