Audio Extractors

Audio files—podcasts, meeting recordings, voice memos—contain valuable information that's invisible to text search. The solution is transcription: convert speech to text, then chunk and embed that text like any other content.

Unrag's audio extraction sends audio files to a speech-to-text service (OpenAI's Whisper by default) and processes the resulting transcript through your normal ingestion pipeline. The text becomes fully searchable, and queries about topics discussed in the recording surface the relevant segments.

The transcription approach

Audio extraction is fundamentally a text extraction problem. Unlike images, where you might embed visual features directly, audio search works through text proxies. Someone searching "quarterly revenue" finds the meeting recording because those words appear in the transcript.

This means transcription quality matters a lot. A good transcript captures not just words but speaker changes, pauses, and context. A poor one produces garbled text that won't match queries. The default Whisper-based extractor handles most audio well, but very noisy recordings or heavy accents may need preprocessing or a specialized model.

Extractor	How it works	Best for
audio:transcribe	Whisper-based transcription	Meetings, podcasts, voice recordings

Installation

The easiest way to install audio transcription is during setup:

bunx unrag@latest init --rich-media

Select audio-transcribe from the list and the CLI handles everything. If you've already run init, you can re-run with --rich-media to add audio support.

Manual installation

bunx unrag@latest add extractor audio-transcribe

import { createAudioTranscribeExtractor } from "./lib/unrag/extractors/audio-transcribe";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createAudioTranscribeExtractor()],
  },
} as const);

Configuration

Audio transcription is controlled via assetProcessing.audio.transcription:

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    assetProcessing: {
      audio: {
        transcription: {
          enabled: true,
          model: "whisper-1",
          maxDurationSec: 3600,
          maxBytes: 25 * 1024 * 1024,
        },
      },
    },
  },
} as const);

The maxDurationSec setting is important for cost control. Long recordings produce long transcripts, which means more chunks and more embedding calls. For a two-hour meeting, you might want to split the audio or accept the higher processing cost.

Usage example

import { readFile } from "node:fs/promises";

const audioBytes = await readFile("./recordings/team-standup.mp3");

const result = await engine.ingest({
  sourceId: "meetings:standup-2024-01-15",
  content: "Daily standup meeting",
  assets: [
    {
      assetId: "standup-audio",
      kind: "audio",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(audioBytes),
        mediaType: "audio/mpeg",
      },
    },
  ],
});

After ingestion, queries like "what did the team decide about the API deadline" can surface chunks from the meeting transcript.

Worker considerations

Audio transcription can be slow—processing a one-hour recording takes meaningful time. For serverless environments with strict timeouts, consider running transcription in a background job. The Next.js Production Recipe covers patterns for handling long-running extractions.