file:pptx Extractor

The file:pptx extractor reads PowerPoint presentations and extracts their text content. It walks through each slide, pulling text from titles, body content, tables, and text boxes. Speaker notes can be included or excluded depending on your needs. The extracted text becomes searchable content.

Presentations often contain concentrated information—the key points of a project, the summary of a quarter's results, the steps of a process. Making them searchable helps surface this content when it matters.

Installation

bunx unrag@latest add extractor file-pptx

import { createFilePptxExtractor } from "./lib/unrag/extractors/file-pptx";

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    extractors: [createFilePptxExtractor()],
  },
} as const);

What gets extracted

The extractor processes slides in order, extracting text from each content element: titles, subtitles, bullet points, paragraphs, table cells, and text boxes. The output preserves slide boundaries, so chunking tends to produce segments aligned with individual slides.

Speaker notes are extracted by default. Notes often contain valuable context that the presenter planned to say but isn't on the slide itself. If your organization's notes contain internal commentary you don't want searchable, you can disable note extraction.

What doesn't extract: images, charts, diagrams, and animations. Presentations often rely heavily on visual content, which this extractor doesn't capture. For slide decks where diagrams and charts carry the information, consider exporting to PDF and using the pdf:llm extractor for visual analysis.

Configuration

export const unrag = defineUnragConfig({
  // ...
  engine: {
    // ...
    assetProcessing: {
      file: {
        pptx: {
          enabled: true,
          maxBytes: 100 * 1024 * 1024,
          maxOutputChars: 500_000,
          includeNotes: true,
        },
      },
    },
  },
} as const);

maxBytes limits file size. Presentations with embedded videos or many high-resolution images can grow large.

maxOutputChars truncates very long presentations. Most slide decks stay well under this limit.

includeNotes controls whether speaker notes are extracted. Set to false if notes contain content that shouldn't be searchable.

Usage example

import { readFile } from "node:fs/promises";

const pptxBytes = await readFile("./presentations/roadmap.pptx");

await engine.ingest({
  sourceId: "planning:roadmap-2024",
  content: "2024 Product Roadmap",
  assets: [
    {
      assetId: "roadmap-slides",
      kind: "file",
      data: {
        kind: "bytes",
        bytes: new Uint8Array(pptxBytes),
        mediaType: "application/vnd.openxmlformats-officedocument.presentationml.presentation",
        filename: "roadmap.pptx",
      },
    },
  ],
});

The output structure

The extracted text maintains slide organization. A typical extraction looks like:

--- Slide 1 ---
Q4 Planning
Engineering Roadmap 2024

--- Slide 2 ---
Key Priorities
Launch v2.0 API
Mobile app beta
Performance improvements

Notes: Remember to discuss the timeline shift

This structure helps the chunker create meaningful segments. A search result from a presentation points to specific slides, not just somewhere in the deck.

Speaker notes

Speaker notes often contain the "why" behind the slides—explanations the presenter planned to give, context that didn't fit on the slide, reminders about points to emphasize. Including them in extraction makes this context searchable.

But notes can also contain things you don't want searchable: private reminders, draft ideas, commentary about the audience. Review a sample of your organization's presentations before deciding on the includeNotes setting.

Visual content limitations

Presentations lean heavily on visual communication. Diagrams, charts, photos, and icons carry meaning that text extraction misses entirely. A slide showing a process flowchart extracts as just the text labels, losing the relationships between steps.

For presentations where visual content matters, consider a multi-pronged approach: extract text for the searchable content that's there, and separately process a PDF export through the pdf:llm extractor to capture visual information.