Notion with Rich Media

Notion pages often include more than just text—embedded PDFs, images, file attachments, audio clips. This example shows how to sync Notion content with full rich media support.

What you'll build

A sync script that:

Connects to your Notion workspace
Syncs pages with all their embedded media
Extracts text from PDFs using LLM
Embeds images (or their captions)
Makes everything searchable

Prerequisites

A Notion integration with access to your pages
Unrag set up with the Notion connector (npx unrag@latest add connector notion)
A database adapter configured

Configuration

First, ensure your config enables PDF extraction:

// unrag.config.ts
import { defineUnragConfig } from "./lib/unrag/core";

export const unrag = defineUnragConfig({
  defaults: {
    chunking: { chunkSize: 200, chunkOverlap: 40 },
    retrieval: { topK: 8 },
  },
  embedding: {
    provider: "ai",
    config: {
    type: "text", // or "multimodal" for direct image embedding
    model: "openai/text-embedding-3-small",
  },
  },
  engine: {
  assetProcessing: {
    onUnsupportedAsset: "skip", // Skip audio/video for now
      onError: "skip", // Don't fail on individual asset errors
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
      },
    },
    fetch: {
      allowedHosts: [
        // Notion's asset CDN
        "prod-files-secure.s3.us-west-2.amazonaws.com",
      ],
    },
  },
  },
} as const);

// ... createUnragEngine setup

The sync script

// scripts/sync-notion.ts
import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";

async function main() {
  const engine = createUnragEngine();

  console.log("Syncing Notion pages with rich media...\n");

  const stream = notionConnector.streamPages({
    token: process.env.NOTION_TOKEN!,
    pageIds: [
      // Your page IDs
      "abc123...",
      "def456...",
    ],
  });

  let pageCount = 0;
  let chunkCount = 0;

  const result = await engine.runConnectorStream({
    stream,
    onEvent: (event) => {
      if (event.type === "progress" && event.message === "page:success") {
        pageCount++;
        console.log(`✓ Synced ${event.sourceId}`);
      }
      if (event.type === "warning") {
        console.warn(`⚠ [${event.code}] ${event.message}`);
      }
    },
  });

  console.log(`\n✓ Synced ${result.upserts} pages`);
}

main().catch(console.error);

What gets extracted

The Notion connector automatically extracts these block types as assets:

Notion Block	Asset Kind	Handling
Image	`image`	Direct embedding or caption
PDF	`pdf`	LLM extraction
File	`file`	Skipped in v1
Audio	`audio`	Skipped in v1
Video	`video`	Skipped in v1

For each asset, the connector captures:

URL: The Notion CDN URL for the file
Caption: Any caption text below the block
Block ID: Used as the stable assetId
Media type: Inferred from the block or URL

Enabling multimodal for images

If you want images embedded directly (not just captions), switch to a multimodal embedding model:

export const unrag = defineUnragConfig({
  // ... rest of config
  embedding: {
    provider: "ai",
    config: {
    type: "multimodal",
    model: "cohere/embed-v4.0",
  },
  },
} as const);

Now a query like "architecture diagram" can find actual diagrams, not just pages that mention them.

Handling sync errors

The streaming model makes error handling natural. Warnings are emitted as events rather than throwing exceptions:

const warnings: Array<{ code: string; message: string }> = [];

const stream = notionConnector.streamPages({
  token: process.env.NOTION_TOKEN!,
  pageIds,
});

const result = await engine.runConnectorStream({
  stream,
  onEvent: (event) => {
    if (event.type === "warning") {
      warnings.push({ code: event.code, message: event.message });
    }
  },
});

if (warnings.length > 0) {
  console.error("Sync completed with warnings:");
  for (const w of warnings) {
    console.error(`  [${w.code}] ${w.message}`);
  }
}

With onError: "skip" in your asset processing config, individual asset failures won't stop the sync. Check your logs for skipped assets.

Querying the results

After syncing, query across all content types:

const engine = createUnragEngine();

// This finds content from text, PDFs, and images
const result = await engine.retrieve({
  query: "quarterly revenue projections",
  topK: 10,
});

for (const chunk of result.chunks) {
  const source = chunk.metadata.extractor || "text";
  console.log(`[${source}] ${chunk.content.slice(0, 100)}...`);
}

Incremental sync with checkpoints

For production, persist checkpoints so interrupted syncs can resume:

import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";

async function incrementalSync(tenantId: string, pageIds: string[]) {
  const engine = createUnragEngine();
  const lastCheckpoint = await loadCheckpoint(tenantId);

  const stream = notionConnector.streamPages({
    token: process.env.NOTION_TOKEN!,
    pageIds,
    checkpoint: lastCheckpoint,
  });

  const result = await engine.runConnectorStream({
    stream,
    onCheckpoint: async (checkpoint) => {
      await saveCheckpoint(tenantId, checkpoint);
    },
    onEvent: (event) => {
      if (event.type === "progress" && event.message === "page:success") {
        console.log(`✓ ${event.sourceId}`);
      }
    },
  });

  console.log(`Synced ${result.upserts} pages`);
  return result;
}

If the sync times out, the next invocation picks up exactly where it left off.

Cost optimization

PDF extraction uses LLM API calls. For large workspaces:

Sync text first: Disable PDF extraction initially, then enable for important pages
Filter by page properties: Only extract from "Published" or "Important" pages
Set size limits: Skip very large PDFs with maxBytes

// Selective extraction based on page properties
const importantPageIds = pages
  .filter((p) => p.properties.Status === "Published")
  .map((p) => p.id);

const stream = notionConnector.streamPages({
  token: process.env.NOTION_TOKEN!,
  pageIds: importantPageIds,
  // PDF extraction enabled in config
});

await engine.runConnectorStream({ stream });

// Bulk sync for other pages without PDF extraction
const otherPageIds = pages
  .filter((p) => p.properties.Status !== "Published")
  .map((p) => p.id);

// Use loadNotionPageDocument for fine-grained control
for (const pageId of otherPageIds) {
  const doc = await loadNotionPageDocument({
    notion: createNotionClient({ token: process.env.NOTION_TOKEN! }),
    pageIdOrUrl: pageId,
  });

  await engine.ingest({
    sourceId: doc.sourceId,
    content: doc.content,
    assets: doc.assets,
    metadata: doc.metadata,
    assetProcessing: {
      pdf: { llmExtraction: { enabled: false } },
    },
  });
}