Notion with Rich Media
Sync Notion pages that include PDFs, images, and other embeds.
Notion pages often include more than just text—embedded PDFs, images, file attachments, audio clips. This example shows how to sync Notion content with full rich media support.
What you'll build
A sync script that:
- Connects to your Notion workspace
- Syncs pages with all their embedded media
- Extracts text from PDFs using LLM
- Embeds images (or their captions)
- Makes everything searchable
Prerequisites
- A Notion integration with access to your pages
- Unrag set up with the Notion connector (
npx unrag@latest add connector notion) - A database adapter configured
Configuration
First, ensure your config enables PDF extraction:
// unrag.config.ts
import { defineUnragConfig } from "./lib/unrag/core";
export const unrag = defineUnragConfig({
defaults: {
chunking: { chunkSize: 200, chunkOverlap: 40 },
retrieval: { topK: 8 },
},
embedding: {
provider: "ai",
config: {
type: "text", // or "multimodal" for direct image embedding
model: "openai/text-embedding-3-small",
},
},
engine: {
assetProcessing: {
onUnsupportedAsset: "skip", // Skip audio/video for now
onError: "skip", // Don't fail on individual asset errors
pdf: {
llmExtraction: {
enabled: true,
model: "google/gemini-2.0-flash",
},
},
fetch: {
allowedHosts: [
// Notion's asset CDN
"prod-files-secure.s3.us-west-2.amazonaws.com",
],
},
},
},
} as const);
// ... createUnragEngine setupThe sync script
// scripts/sync-notion.ts
import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";
async function main() {
const engine = createUnragEngine();
console.log("Syncing Notion pages with rich media...\n");
const stream = notionConnector.streamPages({
token: process.env.NOTION_TOKEN!,
pageIds: [
// Your page IDs
"abc123...",
"def456...",
],
});
let pageCount = 0;
let chunkCount = 0;
const result = await engine.runConnectorStream({
stream,
onEvent: (event) => {
if (event.type === "progress" && event.message === "page:success") {
pageCount++;
console.log(`✓ Synced ${event.sourceId}`);
}
if (event.type === "warning") {
console.warn(`⚠ [${event.code}] ${event.message}`);
}
},
});
console.log(`\n✓ Synced ${result.upserts} pages`);
}
main().catch(console.error);What gets extracted
The Notion connector automatically extracts these block types as assets:
| Notion Block | Asset Kind | Handling |
|---|---|---|
| Image | image | Direct embedding or caption |
pdf | LLM extraction | |
| File | file | Skipped in v1 |
| Audio | audio | Skipped in v1 |
| Video | video | Skipped in v1 |
For each asset, the connector captures:
- URL: The Notion CDN URL for the file
- Caption: Any caption text below the block
- Block ID: Used as the stable
assetId - Media type: Inferred from the block or URL
Enabling multimodal for images
If you want images embedded directly (not just captions), switch to a multimodal embedding model:
export const unrag = defineUnragConfig({
// ... rest of config
embedding: {
provider: "ai",
config: {
type: "multimodal",
model: "cohere/embed-v4.0",
},
},
} as const);Now a query like "architecture diagram" can find actual diagrams, not just pages that mention them.
Handling sync errors
The streaming model makes error handling natural. Warnings are emitted as events rather than throwing exceptions:
const warnings: Array<{ code: string; message: string }> = [];
const stream = notionConnector.streamPages({
token: process.env.NOTION_TOKEN!,
pageIds,
});
const result = await engine.runConnectorStream({
stream,
onEvent: (event) => {
if (event.type === "warning") {
warnings.push({ code: event.code, message: event.message });
}
},
});
if (warnings.length > 0) {
console.error("Sync completed with warnings:");
for (const w of warnings) {
console.error(` [${w.code}] ${w.message}`);
}
}With onError: "skip" in your asset processing config, individual asset failures won't stop the sync. Check your logs for skipped assets.
Querying the results
After syncing, query across all content types:
const engine = createUnragEngine();
// This finds content from text, PDFs, and images
const result = await engine.retrieve({
query: "quarterly revenue projections",
topK: 10,
});
for (const chunk of result.chunks) {
const source = chunk.metadata.extractor || "text";
console.log(`[${source}] ${chunk.content.slice(0, 100)}...`);
}Incremental sync with checkpoints
For production, persist checkpoints so interrupted syncs can resume:
import { createUnragEngine } from "../unrag.config";
import { notionConnector } from "./lib/unrag/connectors/notion";
async function incrementalSync(tenantId: string, pageIds: string[]) {
const engine = createUnragEngine();
const lastCheckpoint = await loadCheckpoint(tenantId);
const stream = notionConnector.streamPages({
token: process.env.NOTION_TOKEN!,
pageIds,
checkpoint: lastCheckpoint,
});
const result = await engine.runConnectorStream({
stream,
onCheckpoint: async (checkpoint) => {
await saveCheckpoint(tenantId, checkpoint);
},
onEvent: (event) => {
if (event.type === "progress" && event.message === "page:success") {
console.log(`✓ ${event.sourceId}`);
}
},
});
console.log(`Synced ${result.upserts} pages`);
return result;
}If the sync times out, the next invocation picks up exactly where it left off.
Cost optimization
PDF extraction uses LLM API calls. For large workspaces:
- Sync text first: Disable PDF extraction initially, then enable for important pages
- Filter by page properties: Only extract from "Published" or "Important" pages
- Set size limits: Skip very large PDFs with
maxBytes
// Selective extraction based on page properties
const importantPageIds = pages
.filter((p) => p.properties.Status === "Published")
.map((p) => p.id);
const stream = notionConnector.streamPages({
token: process.env.NOTION_TOKEN!,
pageIds: importantPageIds,
// PDF extraction enabled in config
});
await engine.runConnectorStream({ stream });
// Bulk sync for other pages without PDF extraction
const otherPageIds = pages
.filter((p) => p.properties.Status !== "Published")
.map((p) => p.id);
// Use loadNotionPageDocument for fine-grained control
for (const pageId of otherPageIds) {
const doc = await loadNotionPageDocument({
notion: createNotionClient({ token: process.env.NOTION_TOKEN! }),
pageIdOrUrl: pageId,
});
await engine.ingest({
sourceId: doc.sourceId,
content: doc.content,
assets: doc.assets,
metadata: doc.metadata,
assetProcessing: {
pdf: { llmExtraction: { enabled: false } },
},
});
}