Database Setup

Unrag stores everything in Postgres using the pgvector extension. This means you can run vector similarity search alongside your regular application data, use your existing backup and replication setup, and avoid adding another system to your infrastructure.

Why Postgres with pgvector?

Most teams already run Postgres, or can easily provision it from their cloud provider. Adding pgvector is usually just enabling an extension—there's no separate service to deploy, no additional credentials to manage, and no data to sync between systems.

pgvector gives you proper vector indexing and similarity search using standard SQL. You can join your embeddings with your application tables, use transactions that span both, and rely on Postgres's mature tooling for migrations, backups, and monitoring.

Enabling pgvector

Before creating tables, you need to enable the pgvector extension in your database:

create extension if not exists vector;

How you run this depends on your database provider:

Run the SQL directly, or add it to your migration files. You may need to install the pgvector extension first—on Ubuntu, that's apt install postgresql-15-pgvector (adjust for your Postgres version).

pgvector is pre-installed. Just run create extension if not exists vector; and you're ready.

pgvector is available in the dashboard under Database → Extensions. Enable it there, or run the SQL in the SQL Editor.

pgvector is available on RDS for PostgreSQL 15.2+. You enable it with the same create extension command.

Enable pgvector through the Azure portal under Server parameters, then run the extension command.

pgvector is supported on Cloud SQL for PostgreSQL. Enable it via the Google Cloud Console or gcloud CLI.

Creating the schema

Unrag expects three tables. Here's the schema with explanations:

create table documents (
  id uuid primary key,
  source_id text not null,
  content text not null,
  metadata jsonb,
  created_at timestamp default now()
);

The documents table represents logical documents you've ingested. The source_id is your identifier for the document—something like docs:getting-started or kb:article-123. When you re-ingest content with the same source_id, Unrag updates the existing document. The content column stores the full original text (useful for debugging and re-chunking) and can be disabled via storage.storeDocumentContent (Unrag will store an empty string). metadata holds any JSON you passed during ingestion.

create table chunks (
  id uuid primary key,
  document_id uuid not null references documents(id) on delete cascade,
  source_id text not null,
  idx integer not null,
  content text not null,
  token_count integer not null,
  metadata jsonb,
  created_at timestamp default now()
);

The chunks table holds the individual pieces that documents are split into. Each chunk belongs to a document (document_id), has an index showing its position in the original document (idx), contains the chunk's text (content), and records an approximate token count. You can disable persisting chunk text via storage.storeChunkContent (retrieval still works, but chunk.content will be empty). The source_id is denormalized here for faster filtering during retrieval.

create table embeddings (
  chunk_id uuid primary key references chunks(id) on delete cascade,
  embedding vector,
  embedding_dimension integer,
  created_at timestamp default now()
);

The embeddings table stores the vector representation of each chunk. The embedding column uses pgvector's vector type. We store embedding_dimension alongside each row so you can detect mismatches if you switch embedding models.

The cascade deletes mean that when you delete a document, its chunks and embeddings are automatically cleaned up.

Recommended indexes

For production workloads, add indexes to speed up common queries:

-- Index for filtering by source_id during retrieval
create index if not exists chunks_source_id_idx on chunks(source_id);
create index if not exists documents_source_id_idx on documents(source_id);

If you're running similarity searches frequently, consider adding a vector index:

-- HNSW index for faster approximate nearest neighbor search
create index if not exists embeddings_hnsw_idx 
on embeddings using hnsw (embedding vector_cosine_ops);

The HNSW index trades some accuracy for significantly faster queries on large datasets. For small to medium datasets (under 100,000 chunks), the default sequential scan is often fast enough.

Applying migrations

How you apply this schema depends on your workflow:

Raw SQL files: Save the schema to a file and run it with psql:

psql $DATABASE_URL -f schema.sql

Drizzle migrations: If you're using the Drizzle adapter, you can import Unrag's schema into your Drizzle configuration and generate migrations from there. See the Drizzle adapter documentation for details.

Prisma migrations: Create a migration with prisma migrate dev --create-only, paste the SQL into the generated file, then apply it. Prisma doesn't natively support the vector type, but raw SQL migrations work fine.

Other migration tools: Flyway, Liquibase, golang-migrate, and similar tools all support running raw SQL. Add the schema as a versioned migration file.

Environment configuration

Unrag reads your database connection from the DATABASE_URL environment variable:

# Standard Postgres connection string
DATABASE_URL="postgresql://user:password@localhost:5432/mydb"

# Neon (with SSL required)
DATABASE_URL="postgresql://user:password@ep-xxx.us-east-1.aws.neon.tech/mydb?sslmode=require"

# Supabase (connection pooler)
DATABASE_URL="postgresql://postgres:password@db.xxx.supabase.co:5432/postgres"

# AWS RDS
DATABASE_URL="postgresql://user:password@mydb.xxx.us-east-1.rds.amazonaws.com:5432/mydb"

Make sure your connection string includes SSL parameters if your provider requires them. For serverless environments, use connection pooling endpoints when available (Neon and Supabase both offer these).

Verifying your setup

Once you've created the schema and configured your environment, you can verify everything is working with the doctor command:

bunx unrag doctor --db

This connects to your database and checks that pgvector is enabled, the tables exist with the expected columns, foreign keys have cascade deletes, and recommended indexes are present. If anything is misconfigured, doctor tells you what's wrong and how to fix it.

For more on using doctor, see the Validating Your Installation guide.