Unrag
Examples

Notion with Rich Media

Sync Notion pages that include PDFs, images, and other embeds.

Notion pages often include more than just text—embedded PDFs, images, file attachments, audio clips. This example shows how to sync Notion content with full rich media support.

What you'll build

A sync script that:

  1. Connects to your Notion workspace
  2. Syncs pages with all their embedded media
  3. Extracts text from PDFs using LLM
  4. Embeds images (or their captions)
  5. Makes everything searchable

Prerequisites

  • A Notion integration with access to your pages
  • Unrag set up with the Notion connector (npx unrag@latest add connector notion)
  • A database adapter configured

Configuration

First, ensure your config enables PDF extraction:

// unrag.config.ts
import { defineUnragConfig } from "./lib/unrag/core";

export const unrag = defineUnragConfig({
  defaults: {
    chunking: { chunkSize: 200, chunkOverlap: 40 },
    retrieval: { topK: 8 },
  },
  embedding: {
    provider: "ai",
    config: {
    type: "text", // or "multimodal" for direct image embedding
    model: "openai/text-embedding-3-small",
  },
  },
  engine: {
  assetProcessing: {
    onUnsupportedAsset: "skip", // Skip audio/video for now
      onError: "skip", // Don't fail on individual asset errors
    pdf: {
      llmExtraction: {
        enabled: true,
        model: "google/gemini-2.0-flash",
      },
    },
    fetch: {
      allowedHosts: [
        // Notion's asset CDN
        "prod-files-secure.s3.us-west-2.amazonaws.com",
      ],
    },
  },
  },
} as const);

// ... createUnragEngine setup

The sync script

// scripts/sync-notion.ts
import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";

async function main() {
  const engine = createUnragEngine();

  console.log("Syncing Notion pages with rich media...\n");

  const stream = notionConnector.streamPages({
    token: process.env.NOTION_TOKEN!,
    pageIds: [
      // Your page IDs
      "abc123...",
      "def456...",
    ],
  });

  let pageCount = 0;
  let chunkCount = 0;

  const result = await engine.runConnectorStream({
    stream,
    onEvent: (event) => {
      if (event.type === "progress" && event.message === "page:success") {
        pageCount++;
        console.log(`✓ Synced ${event.sourceId}`);
      }
      if (event.type === "warning") {
        console.warn(`⚠ [${event.code}] ${event.message}`);
      }
    },
  });

  console.log(`\n✓ Synced ${result.upserts} pages`);
}

main().catch(console.error);

What gets extracted

The Notion connector automatically extracts these block types as assets:

Notion BlockAsset KindHandling
ImageimageDirect embedding or caption
PDFpdfLLM extraction
FilefileSkipped in v1
AudioaudioSkipped in v1
VideovideoSkipped in v1

For each asset, the connector captures:

  • URL: The Notion CDN URL for the file
  • Caption: Any caption text below the block
  • Block ID: Used as the stable assetId
  • Media type: Inferred from the block or URL

Enabling multimodal for images

If you want images embedded directly (not just captions), switch to a multimodal embedding model:

export const unrag = defineUnragConfig({
  // ... rest of config
  embedding: {
    provider: "ai",
    config: {
    type: "multimodal",
    model: "cohere/embed-v4.0",
  },
  },
} as const);

Now a query like "architecture diagram" can find actual diagrams, not just pages that mention them.

Handling sync errors

The streaming model makes error handling natural. Warnings are emitted as events rather than throwing exceptions:

const warnings: Array<{ code: string; message: string }> = [];

const stream = notionConnector.streamPages({
  token: process.env.NOTION_TOKEN!,
  pageIds,
});

const result = await engine.runConnectorStream({
  stream,
  onEvent: (event) => {
    if (event.type === "warning") {
      warnings.push({ code: event.code, message: event.message });
    }
  },
});

if (warnings.length > 0) {
  console.error("Sync completed with warnings:");
  for (const w of warnings) {
    console.error(`  [${w.code}] ${w.message}`);
  }
}

With onError: "skip" in your asset processing config, individual asset failures won't stop the sync. Check your logs for skipped assets.

Querying the results

After syncing, query across all content types:

const engine = createUnragEngine();

// This finds content from text, PDFs, and images
const result = await engine.retrieve({
  query: "quarterly revenue projections",
  topK: 10,
});

for (const chunk of result.chunks) {
  const source = chunk.metadata.extractor || "text";
  console.log(`[${source}] ${chunk.content.slice(0, 100)}...`);
}

Incremental sync with checkpoints

For production, persist checkpoints so interrupted syncs can resume:

import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";

async function incrementalSync(tenantId: string, pageIds: string[]) {
  const engine = createUnragEngine();
  const lastCheckpoint = await loadCheckpoint(tenantId);

  const stream = notionConnector.streamPages({
    token: process.env.NOTION_TOKEN!,
    pageIds,
    checkpoint: lastCheckpoint,
  });

  const result = await engine.runConnectorStream({
    stream,
    onCheckpoint: async (checkpoint) => {
      await saveCheckpoint(tenantId, checkpoint);
    },
    onEvent: (event) => {
      if (event.type === "progress" && event.message === "page:success") {
        console.log(`${event.sourceId}`);
      }
    },
  });

  console.log(`Synced ${result.upserts} pages`);
  return result;
}

If the sync times out, the next invocation picks up exactly where it left off.

Cost optimization

PDF extraction uses LLM API calls. For large workspaces:

  1. Sync text first: Disable PDF extraction initially, then enable for important pages
  2. Filter by page properties: Only extract from "Published" or "Important" pages
  3. Set size limits: Skip very large PDFs with maxBytes
// Selective extraction based on page properties
const importantPageIds = pages
  .filter((p) => p.properties.Status === "Published")
  .map((p) => p.id);

const stream = notionConnector.streamPages({
  token: process.env.NOTION_TOKEN!,
  pageIds: importantPageIds,
  // PDF extraction enabled in config
});

await engine.runConnectorStream({ stream });

// Bulk sync for other pages without PDF extraction
const otherPageIds = pages
  .filter((p) => p.properties.Status !== "Published")
  .map((p) => p.id);

// Use loadNotionPageDocument for fine-grained control
for (const pageId of otherPageIds) {
  const doc = await loadNotionPageDocument({
    notion: createNotionClient({ token: process.env.NOTION_TOKEN! }),
    pageIdOrUrl: pageId,
  });

  await engine.ingest({
    sourceId: doc.sourceId,
    content: doc.content,
    assets: doc.assets,
    metadata: doc.metadata,
    assetProcessing: {
      pdf: { llmExtraction: { enabled: false } },
    },
  });
}

Next steps

On this page

RAG handbook banner image

Free comprehensive guide

Complete RAG Handbook

Learn RAG from first principles to production operations. Tackle decisions, tradeoffs and failure modes in production RAG operations

The RAG handbook covers retrieval augmented generation from foundational principles through production deployment, including quality-latency-cost tradeoffs and operational considerations. Click to access the complete handbook.