Video Extractors
Make video content searchable through audio transcription or frame analysis.
Video combines audio and visual information, and there are two fundamentally different approaches to making it searchable. You can extract the audio track and transcribe it, which captures what was said. Or you can sample frames and analyze them visually, which captures what was shown.
Most video search applications care primarily about speech—lectures, interviews, tutorials, meetings. For these, audio transcription is the right tool. Visual analysis matters more for videos where important information appears on screen: diagrams in presentations, code in tutorials, charts in data walkthroughs.
Two approaches
| Extractor | How it works | Best for |
|---|---|---|
| video:transcribe | Extract audio track, transcribe with Whisper | Lectures, meetings, interviews |
| video:frames | Sample frames, analyze with vision LLM | Presentations, visual tutorials (worker-only) |
The video:transcribe extractor handles most use cases. It pulls the audio track from video files and runs it through the same Whisper-based transcription as audio files. The transcript becomes searchable text.
The video:frames extractor is more specialized. It's a worker-only pipeline that extracts frames at intervals and sends them to a vision-capable LLM for analysis. This captures visual content but requires more infrastructure and costs more per video.
Choosing an approach
For the majority of videos, transcription is sufficient and significantly cheaper. Spoken content tends to be the primary information carrier in most video formats—the presenter explains what's on screen, the speaker describes their process, the interviewer asks questions.
Frame analysis adds value when visual content is genuinely different from spoken content. A screencast where the presenter says "and here you can see the error" benefits from frame analysis that captures what error is actually shown. A talking-head interview gains little from analyzing frames of someone's face.
You can also combine approaches. Run transcription for everything, and selectively run frame analysis on videos where visual content matters. The chunks from each extractor are tagged with their source, so retrieval results show which method surfaced each match.
Installation
The easiest way to install video extractors is during setup:
bunx unrag@latest init --rich-mediaSelect video-transcribe and/or video-frames from the list. If you've already run init, you can re-run with --rich-media to add video support.
Manual installation
Install the extractor that fits your needs:
# For audio transcription (most common)
bunx unrag@latest add extractor video-transcribe
# For frame analysis (worker-only, requires ffmpeg)
bunx unrag@latest add extractor video-framesRegister in your config and enable the corresponding assetProcessing.video.* flags:
import { createVideoTranscribeExtractor } from "./lib/unrag/extractors/video-transcribe";
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
extractors: [createVideoTranscribeExtractor()],
},
} as const);Configuration
Video processing settings live under assetProcessing.video:
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
assetProcessing: {
video: {
transcription: {
enabled: true,
maxDurationSec: 3600,
maxBytes: 100 * 1024 * 1024,
},
frames: {
enabled: false, // Worker-only
intervalSec: 30,
maxFrames: 60,
},
},
},
},
} as const);Transcription settings mirror audio transcription—duration limits, file size caps, language hints. Frame extraction settings control how densely to sample the video and how many frames to analyze.
Video file handling
Video files are large. A one-hour recording might be several gigabytes. The maxBytes setting prevents accidentally processing huge files, but you should also think about where video bytes come from.
For videos stored in cloud storage (S3, GCS), passing a signed URL lets Unrag fetch the file directly rather than loading it into your application's memory first. This is especially important in memory-constrained serverless environments.
await engine.ingest({
sourceId: "training:onboarding-video",
content: "New employee onboarding",
assets: [
{
assetId: "onboarding-mp4",
kind: "video",
data: {
kind: "url",
url: presignedUrl, // Fetch directly from storage
mediaType: "video/mp4",
},
},
],
});Worker considerations
Video processing takes time. Transcribing a 30-minute video might take a minute or more. Frame extraction with LLM analysis takes even longer. For serverless environments, this usually means background jobs.
The video:frames extractor has a harder requirement: it needs ffmpeg installed, which isn't available in standard serverless runtimes. This extractor only makes sense in a worker environment with native binary support.
See the Next.js Production Recipe for patterns around background video processing.
