Best Practices
Production patterns for Google Drive ingestion with Unrag.
The Google Drive connector is designed to be safe and idempotent, but production deployments benefit from a few patterns that handle edge cases gracefully.
Choose the right sync mode
The connector offers two sync modes, and choosing the right one affects both complexity and behavior.
Explicit file sync (streamFiles) is the simpler model. You maintain a list of file IDs and sync those specific files. This works well when users select files through a UI, when you have a curated knowledge base, or when you need predictable, fine-grained control.
Folder sync (streamFolder) is more powerful but requires checkpoint persistence. The connector uses Google's Changes API to track what's new since the last sync, so you get incremental updates without re-processing unchanged files. This works well when users connect a "knowledge base folder" and expect new files to sync automatically.
If you're unsure, start with explicit file sync. It's easier to reason about and doesn't require checkpoint persistence. You can always add folder sync later when the need arises.
Choose the right authentication model
The authentication model you choose affects both security and operational complexity.
OAuth2 is the right choice when users connect their own accounts. Each user controls which files they share, and you store their refresh tokens. This model is more work to implement (you need the OAuth consent flow), but it's the only option for consumer apps or situations where you can't guarantee access to files upfront.
Service accounts with explicit sharing work well for internal tools or small-scale deployments. Share files and folders with the service account's email, and it can access them directly. This is simple but doesn't scale if you need access to many files across different owners.
Service accounts with domain-wide delegation (DWD) are the most powerful option for organizations on Google Workspace. The service account can impersonate any user in your domain, accessing their files without explicit sharing. This is ideal for org-wide knowledge bases or backups. The tradeoff is setup complexity—DWD requires Workspace admin configuration and is only available for Workspace accounts, not consumer Gmail.
If you're building for an organization with Workspace, start with DWD. It's more work upfront but eliminates the "someone forgot to share the folder" failure mode.
Handle token refresh gracefully
OAuth refresh tokens can expire or be revoked. When a user's token stops working, your sync will fail with 401 errors. Build your application to handle this:
try {
const stream = googleDriveConnector.streamFiles({
auth: { kind: "oauth", clientId, clientSecret, redirectUri, refreshToken },
fileIds,
});
await engine.runConnectorStream({ stream });
} catch (err) {
if (isTokenExpiredError(err)) {
// Prompt user to re-authenticate
await markUserNeedsReauth(userId);
return { success: false, reason: "auth_expired" };
}
throw err;
}Service account credentials don't expire in the same way, but the JSON key can be rotated or deleted. If you rotate keys, make sure your deployment picks up the new credentials.
Use namespace prefixes for multi-tenant apps
If your application serves multiple tenants, use sourceIdPrefix to partition content:
const stream = googleDriveConnector.streamFiles({
auth,
fileIds,
sourceIdPrefix: `tenant:${tenantId}:`,
});
await engine.runConnectorStream({ stream });This makes retrieval scoping simple and prevents accidental cross-tenant data leakage. When a tenant disconnects or deletes their account, you can cleanly wipe their content:
await engine.delete({ sourceIdPrefix: `tenant:${tenantId}:` });Persist checkpoints religiously
For folder sync, checkpoints are essential. They contain the Changes API page token that enables incremental updates. Without a checkpoint, every sync processes all files from scratch.
const stream = googleDriveConnector.streamFolder({
auth,
folderId,
checkpoint: await loadCheckpoint(tenantId),
});
await engine.runConnectorStream({
stream,
onCheckpoint: async (checkpoint) => {
await saveCheckpoint(tenantId, checkpoint);
},
});Store checkpoints in your database, keyed by tenant or sync job. The checkpoint is a small JSON object, so storage overhead is minimal.
For explicit file sync, checkpoints are less critical—they just track progress through the file list—but they're still valuable for large syncs or serverless environments with timeout limits.
Enable deleteOnRemoved for folder sync
When using folder sync, consider enabling deleteOnRemoved to keep your index in sync with reality:
const stream = googleDriveConnector.streamFolder({
auth,
folderId,
options: {
recursive: true,
deleteOnRemoved: true,
},
checkpoint,
});
await engine.runConnectorStream({ stream, onCheckpoint: saveCheckpoint });With this option, the connector emits delete events when:
- A file is deleted from Drive
- A file is moved to trash
- A file is moved out of the synced folder
This keeps your search index accurate without manual cleanup. If you don't enable it, removed files will remain in your index until you delete them manually.
Set appropriate file size limits
The default maxBytesPerFile is 15MB, which is reasonable for most documents. If you're ingesting large PDFs or media files, you might want to increase it:
const stream = googleDriveConnector.streamFiles({
auth,
fileIds,
options: {
maxBytesPerFile: 50 * 1024 * 1024, // 50MB
},
});
await engine.runConnectorStream({ stream });But be thoughtful about this. Large files take longer to download, cost more to process (especially for LLM extraction), and may produce many chunks. Consider whether you actually need to ingest huge files, or whether a size limit that skips them is the right behavior.
Run sync in background jobs
For production deployments, don't run sync in request handlers. File downloads, exports, and ingestion can be slow, and you don't want to block user-facing requests or risk timeouts.
Instead, run sync from background jobs: cron scripts, BullMQ workers, Inngest functions, QStash schedules, or similar. This gives you:
- Retries: If a sync fails partway through, you can retry without losing progress
- Observability: Job runners typically provide logging, metrics, and alerting
- Rate limit handling: You can add delays between files to avoid hitting Google's rate limits
- Timeout safety: Background jobs can run longer than HTTP request timeouts
See the Next.js Production Recipe for patterns that work well on Vercel.
Batch large file lists for explicit sync
If you're syncing hundreds or thousands of explicit file IDs, batch your calls and add pauses to stay within Google's rate limits:
const BATCH_SIZE = 20;
const PAUSE_MS = 2000;
for (let i = 0; i < allFileIds.length; i += BATCH_SIZE) {
const batch = allFileIds.slice(i, i + BATCH_SIZE);
const stream = googleDriveConnector.streamFiles({
auth,
fileIds: batch,
});
await engine.runConnectorStream({ stream });
if (i + BATCH_SIZE < allFileIds.length) {
await new Promise((r) => setTimeout(r, PAUSE_MS));
}
}Google Drive has quota limits that vary by API and account type. For most use cases, 20 files with a 2-second pause between batches keeps you safely under the limits.
For folder sync, batching isn't necessary—the Changes API handles pagination internally, and the connector processes one file at a time.
Use onEvent for observability
The streaming model makes it easy to log exactly what's happening during a sync:
await engine.runConnectorStream({
stream,
onEvent: (event) => {
if (event.type === "progress") {
console.log(`[${event.current}/${event.total}] ${event.message}`);
}
if (event.type === "warning") {
console.warn(`Warning: [${event.code}] ${event.message}`);
}
if (event.type === "delete") {
console.log(`Deleted: ${event.input.sourceId}`);
}
},
});Forward these events to your logging/monitoring system to catch issues early.
Monitor for asset processing warnings
The connector downloads files and emits them as assets for the engine to process. If asset processing fails or is disabled for certain file types, you might silently miss content.
Always check the ingest result for warnings when using the lower-level API:
const result = await engine.ingest({
sourceId: doc.sourceId,
content: doc.content,
assets: doc.assets,
metadata: doc.metadata,
});
if (result.warnings.length > 0) {
console.warn("Asset processing warnings:", result.warnings);
// Forward to your monitoring system
}The high-level streaming API handles this internally during ingest, but you can use onEvent to observe skipped files:
await engine.runConnectorStream({
stream,
onEvent: (event) => {
if (event.type === "warning" && event.code === "file_skipped") {
console.warn(`Skipped ${event.data?.fileId}: ${event.message}`);
}
},
});Test with a small scope first
Before running sync on your full file list or folder:
-
For explicit file sync: Test with 2-3 files to verify authentication works, file types are handled as expected, and chunking produces reasonable results.
-
For folder sync: Test with a folder containing a handful of files. Run twice—once to establish the baseline, once to verify incremental updates work.
Once the small test works, scale up gradually. This catches configuration issues early before you've spent time and API credits on a large sync.
Consider shared drives carefully
If you're syncing from shared drives (Team Drives), you need to pass additional options:
const stream = googleDriveConnector.streamFolder({
auth,
folderId: sharedDriveFolderId,
options: {
driveId: sharedDriveId, // Required for shared drives
supportsAllDrives: true,
includeItemsFromAllDrives: true,
},
checkpoint,
});The driveId is typically the ID of the shared drive itself. You can find it in the Drive URL when viewing the shared drive root. Without it, the Changes API won't return items from shared drives.
