Agentic Chunking

Most chunking methods focus on one goal: split documents into reasonable pieces. They respect structure (markdown chunking), identify topic boundaries (semantic chunking), or keep code units intact (code chunking). These are sensible approaches, and they work well.

Agentic chunking takes a different perspective. Instead of asking "where should we split?", it asks "what chunks would best serve retrieval?" The LLM considers how users might query this content and structures chunks to maximize the chance of returning useful results. It's optimization for the end goal, not just for the splitting process.

What makes agentic chunking different

The distinction between semantic and agentic chunking is subtle but important.

Semantic chunking finds natural boundaries. It looks at content and asks: where do topics shift? Where do ideas complete? The goal is coherence—chunks that are internally consistent and don't awkwardly split mid-thought.

Agentic chunking optimizes for retrieval. It looks at content and asks: what would users search for? What chunks would best answer their questions? The goal is queryability—chunks that are likely to match user intent and provide useful answers.

Consider documentation about a software feature:

The export feature allows users to download their data in multiple formats.
CSV exports include all fields by default, while JSON exports use a nested
structure that mirrors the API response format.

To export data, navigate to Settings > Data > Export. Select the format and
date range, then click "Generate Export". Large exports may take several
minutes to process. You'll receive an email when the export is ready.

Export files are available for download for 7 days. After that, you'll need
to generate a new export. Enterprise customers can configure automatic
scheduled exports via the API.

Semantic chunking might produce two chunks: one about export formats, one combining the how-to instructions with availability information (since those paragraphs flow into each other).

Agentic chunking might produce three chunks, each optimized for a different query pattern:

"What export formats are available?" → format descriptions
"How do I export my data?" → step-by-step instructions
"How long are exports available?" → availability and enterprise features

The agentic chunker anticipates what users will ask and structures content to match.

Installation

bunx unrag add chunker:agentic

This installs the agentic chunker plugin. It uses the AI SDK, which should already be present if you're using Unrag's embedding features.

Configuration

Enable agentic chunking in your unrag.config.ts:

export default defineUnragConfig({
  chunking: {
    method: "agentic",
    options: {
      chunkSize: 512,
      chunkOverlap: 50,
      model: "gpt-4o",  // optional: specify LLM model
    },
  },
  // ...
});

When agentic chunking is worth it

Agentic chunking is the most expensive option in Unrag's chunking toolkit. Every document requires an LLM call. For large corpuses, costs add up quickly. So when is it worth it?

High-value content where retrieval quality directly impacts business outcomes. If poor search results mean lost customers, confused users, or missed sales, the cost of better chunking is easy to justify.

Customer-facing search and support systems. Users have low tolerance for irrelevant results. When someone searches your help center, they expect the first result to answer their question. Agentic chunking maximizes that likelihood.

Content that will be queried frequently. If a document will be searched thousands of times, spending an extra $0.03 to chunk it optimally has high ROI. The upfront cost is amortized across many retrievals.

Complex, nuanced documents. Legal contracts, medical protocols, financial regulations—content where precision matters and the difference between a good and mediocre chunk could have real consequences.

When you've tried other chunkers and retrieval quality isn't good enough. Agentic chunking is a tool to reach for when simpler approaches fall short, not a default starting point.

When to use simpler alternatives

Bulk ingestion of large corpuses. If you're indexing 100,000 documents, agentic chunking at $0.03/document is $3,000. Consider using semantic or recursive chunking for the bulk, and reserve agentic chunking for the most important content.

Structured content. Markdown with clear headings, code with function boundaries—these have explicit structure that rule-based chunkers handle well. The LLM's intelligence is less valuable when structure is already clear.

Latency-sensitive pipelines. Each agentic chunking call takes 2-5 seconds (LLM inference time). For real-time or near-real-time ingestion, this may be unacceptable.

Experimental or rapidly-changing content. If you're iterating on content that will be replaced soon, the premium for optimal chunking has less value.

Cost breakdown

Agentic chunking costs vary by model and document size. Here are rough estimates for a 10,000-token document:

Model	Input Cost	Output Cost	Total
gpt-4o-mini	~$0.002	~$0.001	~$0.003
gpt-4o	~$0.025	~$0.010	~$0.035
claude-3.5-sonnet	~$0.030	~$0.015	~$0.045
claude-3-opus	~$0.150	~$0.075	~$0.225

The input cost dominates because the entire document is sent to the LLM. Output is relatively small—just boundary markers or restructured text.

Practical recommendations:

Start with gpt-4o-mini. It's remarkably capable for chunking tasks and costs an order of magnitude less than larger models. The chunking prompt is straightforward; you don't need GPT-4's full reasoning capacity for most content.

Use gpt-4o for complex content. If your documents have subtle nuances, multiple interleaved topics, or require sophisticated judgment about what users might search for, the extra capability helps.

Reserve claude-3-opus or equivalent for critical content. Legal documents, compliance materials, content where getting it wrong has real consequences—these justify the premium.

How the LLM is prompted

The agentic chunker sends your content to the LLM with instructions focused on retrieval optimization:

Consider what queries users might ask about this content
Group information that would answer the same query together
Keep related context together even if it spans formatting boundaries
Never split mid-explanation or mid-example
Create chunks that would be useful standalone search results

The LLM returns either explicit boundary markers or restructured content. Unrag then enforces token limits, adds overlap, and produces the final chunks.

Fallback behavior

LLM calls can fail. Rate limits, network issues, malformed responses—these happen, especially at scale. The agentic chunker handles failures gracefully.

When the LLM call fails, the chunker falls back to sentence-based splitting. Your document still gets chunked and ingested, just without the retrieval optimization. The ingest result includes a warning you can monitor:

const result = await engine.ingest({ sourceId, content });

for (const warning of result.warnings) {
  if (warning.code === "agentic_fallback") {
    // Log for later retry
    await logForRetry(sourceId, warning.message);
  }
}

For critical content, you might want to catch fallbacks and retry during off-peak hours or with a different model.

Practical example

Here's how you might use agentic chunking for a knowledge base where search quality is critical:

import { createUnragEngine } from "@unrag/config";

const engine = createUnragEngine();

async function ingestSupportArticle(article: Article) {
  const result = await engine.ingest({
    sourceId: `support:${article.id}`,
    content: article.body,
    metadata: {
      title: article.title,
      category: article.category,
      lastUpdated: article.updatedAt,
    },
  });

  // Track chunking quality for monitoring
  if (result.warnings.some(w => w.code === "agentic_fallback")) {
    console.warn(`Agentic chunking failed for ${article.id}, using fallback`);
    await metrics.increment("chunking.agentic_fallback");
  }

  console.log(`Chunked "${article.title}" into ${result.chunkCount} chunks`);
  return result;
}

The agentic chunker analyzes each article, considering how support users might search for help, and creates chunks optimized to match those queries.

Agentic vs semantic: a comparison

Both methods use LLMs with similar costs. Here's how they differ:

Aspect	Semantic	Agentic
Goal	Find natural boundaries	Optimize for retrieval
Question asked	"Where do topics change?"	"What would users search for?"
Chunk characteristic	Coherent	Queryable
Best for	Narrative content	Search-critical content
Predictability	More predictable	Less predictable

For most LLM-chunking use cases, start with semantic. It produces reliable, coherent chunks at the same cost. Switch to agentic when you've identified that retrieval quality is the specific bottleneck you need to address.

Monitoring and iteration

Agentic chunking's value shows up in retrieval quality, not ingestion metrics. After implementing it, monitor:

Relevance scores — Are retrieved chunks more relevant to queries?
User feedback — Are search results solving user problems?
Click-through rates — Do users find what they need faster?
Escalation rates — (For support) Are users escalating less after searching?

If metrics improve, the investment is paying off. If they don't, the content might already be well-suited to simpler chunking, or the retrieval problem lies elsewhere (embedding model, reranking, query formulation).