audio:transcribe Extractor
Transcribe audio files using Whisper and embed the transcript.
The audio:transcribe extractor converts speech to text using OpenAI's Whisper model. Audio bytes are sent to the Whisper API, which returns a text transcript. That transcript then flows through your normal chunking and embedding pipeline, making the spoken content searchable.
Whisper handles a wide range of audio: podcasts, meeting recordings, phone calls, voice memos. It's robust to background noise and can transcribe multiple languages. For most audio content, you can send it through without preprocessing.
Installation
bunx unrag@latest add extractor audio-transcribeThen register the extractor:
import { createAudioTranscribeExtractor } from "./lib/unrag/extractors/audio-transcribe";
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
extractors: [createAudioTranscribeExtractor()],
},
} as const);How it works
When an audio asset arrives, the extractor uploads the audio bytes to the Whisper API. Whisper processes the entire file and returns a transcript. That transcript becomes the text content that gets chunked and embedded.
The transcript inherits the asset's metadata, so chunks from audio files are tagged with metadata.extractor: "audio:transcribe". This lets you distinguish spoken content from written content in your retrieval results.
Configuration
Enable and configure transcription in your engine config:
export const unrag = defineUnragConfig({
// ...
engine: {
// ...
assetProcessing: {
audio: {
transcription: {
enabled: true,
model: "whisper-1",
maxDurationSec: 3600,
maxBytes: 25 * 1024 * 1024,
language: "en",
},
},
},
},
} as const);model specifies the Whisper model. Currently whisper-1 is the only option through OpenAI's API.
maxDurationSec sets the maximum audio duration to process. Whisper can handle long recordings, but cost and latency scale with duration. A one-hour recording at 3600 seconds is a reasonable default. Longer files are skipped with a warning.
maxBytes caps the file size. Whisper accepts up to 25MB per request.
language hints at the audio's language. If omitted, Whisper auto-detects. Specifying the language can improve accuracy for non-English content.
Supported formats
Whisper accepts most common audio formats: mp3, mp4, m4a, wav, webm, and more. The file's media type should be set correctly in the asset data, but Whisper is generally good at detecting format from the bytes.
Usage example
Here's a typical flow for ingesting a meeting recording:
import { readFile } from "node:fs/promises";
const meetingAudio = await readFile("./recordings/product-review.mp3");
const result = await engine.ingest({
sourceId: "meetings:product-review-jan",
content: "January product review meeting",
assets: [
{
assetId: "meeting-audio",
kind: "audio",
data: {
kind: "bytes",
bytes: new Uint8Array(meetingAudio),
mediaType: "audio/mpeg",
},
},
],
});
console.log(`Created ${result.chunkCount} chunks from transcript`);For audio from URLs (maybe from a podcast RSS feed or cloud storage):
await engine.ingest({
sourceId: "podcast:episode-42",
content: "Episode 42: The Future of AI",
assets: [
{
assetId: "episode-42-audio",
kind: "audio",
data: {
kind: "url",
url: "https://cdn.example.com/podcasts/ep42.mp3",
mediaType: "audio/mpeg",
},
},
],
});What gets stored
Each chunk from the transcript includes:
| Field | Content |
|---|---|
chunk.content | Portion of the transcript text |
chunk.metadata.assetKind | "audio" |
chunk.metadata.assetId | Your provided ID |
chunk.metadata.extractor | "audio:transcribe" |
Cost and latency
Whisper pricing is based on audio duration, not transcript length. A one-minute file costs the same whether the speaker talks fast or slow. Check OpenAI's pricing page for current rates.
Transcription latency scales roughly linearly with duration. A five-minute recording might take 10-20 seconds to transcribe. For long recordings in production systems, consider using background jobs rather than blocking request handlers.
Handling long recordings
Very long recordings (multi-hour meetings, full podcast episodes) can strain serverless timeouts and produce very long transcripts. A few strategies help:
Split before ingestion if possible. If your source provides chapter markers or natural break points, ingest as separate assets. This also makes retrieval results more precise—finding "the part about budgets" is easier when chapters are separate.
Accept the processing time for batch jobs. If you're ingesting a library of recordings overnight, long transcription times may be acceptable.
Use a worker environment for processing. The Next.js Production Recipe covers setting up background jobs for heavy extraction work.
Troubleshooting
If transcription produces empty or poor results, check that the audio is clear enough for speech recognition. Very noisy recordings, heavy accents, or multiple overlapping speakers can challenge Whisper. You might need to preprocess the audio (noise reduction, speaker separation) before ingestion.
If processing times out, the audio may exceed your configured limits or your infrastructure's timeout. For serverless, consider shorter maxDurationSec or moving to background processing.
