On chunking long-form documents — and why structure beats size

Pick up an Indian tax statute — the kind we spend our days indexing — and look at a single section. It is not prose. It is a small, precise machine.

A typical section has a numbered head clause, two or three sub-clauses, an explanation block, and a "Provided that..." proviso carving out an exception only meaningful in the context of the head. Sometimes a rate table follows. Sometimes an amendment marker substitutes a phrase ("twenty per cent" for "fifteen per cent") by reference to a later Act. None of these are decorative. Each changes what the section means.

/ A statute section, abridged 74. (1) Where any tax has not been paid by reason of fraud or wilful misstatement, the proper officer shall serve a notice requiring the person to show cause why he should not pay the amount along with interest and a penalty equivalent to the tax. (2) The notice shall issue at least six months prior to the time limit specified in sub-section (10) for issuance of order. Provided that where any person served with a notice under sub-section (1) pays the tax with interest and a penalty equivalent to fifteen per cent within thirty days, all proceedings in respect of the said notice shall be deemed to be concluded. Explanation. — For the purposes of this section, "suppression" shall mean non-declaration of facts which a taxable person is required to declare in the return.

Now imagine a fixed-window splitter walking through that text every 1000 characters. The proviso lands in a different chunk from the section it modifies. The explanation gets stranded from the term it defines. A retriever that finds the proviso cannot tell you what it's a proviso to. A retriever that finds the head clause cannot tell you the carve-out exists.

This is the failure mode that taught us, on document one, that we could not reuse anyone's off-the-shelf chunker.

Why generic chunkers fail

The popular open-source chunkers — recursive-character, fixed-window, sentence-aware — share one assumption: the document is prose. They look for paragraph breaks, then sentences, then words, then characters, and slice when the running window hits a target size.

For a blog post, fine. Prose tolerates being cut anywhere. For a statute it is the wrong frame entirely. The document is not paragraphs — it is a hierarchy. Section 74 contains sub-section (2) contains a proviso contains an explanation. The hierarchy is the meaning. A splitter that ignores it is splitting the meaning.

The first chunker we tried sliced a real statute with a generic prose window. Almost every chunk came out with no section reference attached. That's a system that retrieves, but doesn't cite. In this domain, retrieval without citation is no system at all.

You cannot tune a generic chunker into a structure-aware one. We tried. Each parameter we tuned shifted the failure mode without removing it. The mismatch is structural, not numerical.

Understand first, slice second

The principle that finally worked is unglamorous and short.

Two passes over every document. The first pass is a small language-model call that reads the document and produces a per-document chunking plan — what kind of document this is, where its boundaries are, which tables stay atomic, which provisos attach to which parents. The second pass is a boring, deterministic splitter that consumes that plan and emits chunks.

The intelligence is concentrated where it has to be — figuring out the shape of an unfamiliar document — and excluded from where it shouldn't be: cutting the bytes. We do not want a model deciding, on the fly, where a chunk boundary goes; that is not reproducible. We do want a small model deciding what kind of document this is and what its sectioning rules are.

The output is a flat list of chunks, each carrying its parent hierarchy as metadata. A proviso chunk knows it lives under section 74(1). A table-row chunk knows it lives under the rate schedule of a specific notification. A retriever can filter by that metadata; a citation can render from it. The chunk is no longer a naked thousand characters — it is a span that knows where it came from.

The categories of bug we caught

Even with the right principle, the chunker is a system, not a function. Over a year of iteration we caught bugs in three recurring shapes — the same shapes we then noticed in every public off-the-shelf chunker we evaluated. They are what you get whenever the slicer doesn't know about structure.

Orphaned provisos

The "Provided that..." clause gets sliced off from the section it modifies. The retriever pulls back the proviso, but the proviso says "the said notice" and the said notice is in a different chunk. The answer reads confidently and cites a fragment that doesn't make sense without its parent.
Mid-table splits

Tariff schedules, rate tables, classification tables — these are atomic. A row in isolation is meaningless; the column headers live three rows up. A naive splitter cuts mid-table and returns half a table; the answer model invents the column headers. Tables must stay atomic, even when that breaks the size budget.
Tail-duplicates

The unglamorous one. A sliding-window loop can, at the end of a span, emit one final chunk whose range is fully contained inside the previous chunk. Same text, second copy in the index. You don't notice until your reranker is fighting two copies of the same passage for the same query.

None of these are clever bugs. They are the bugs you get whenever you slice without understanding. They are the reason "just split your documents into 1000-character chunks and embed them" is a lie when the documents aren't blog posts.

What this means for any regulated corpus

We learned this on tax law because that's what we ship. But the principle isn't specific to tax law — it is specific to a class of document. Anywhere a sentence's meaning depends on the clause above it, you have the same problem.

Contracts have it: "subject to clause 12.4" in clause 8 is an instruction to read clause 12.4 too. Regulatory filings have it: a 10-K's risk factors reference financial statements pages later. Technical manuals have it: a "WARNING: complete step 4 first" callout is meaningless without step 4. Insurance policies, drug labels, accounting standards, court judgments — the list goes on. The common feature is not the subject matter but the structural discipline: the document was written under rules that say what attaches to what.

The corollary: if you are building retrieval over any of these corpora, the chunker is not a commodity layer to outsource to whatever the popular framework ships with. It is the layer that decides whether your retrieval can ever work. Treat it that way.

The cost of skipping the discipline

A bad chunker does not fail loudly. It fails silently. The system still returns answers. Retrieval logs look healthy. Latency is fine. Every dashboard is green. But somewhere in the index a meaningful share of your chunks are misshaped, and a meaningful share of your queries are quietly failing — returning the second-best chunk because the best was orphaned, or a fragment whose meaning is incomplete, or the wrong half of a table.

The chunker puts a ceiling on every other layer above it. You can buy a better embedding model, a better reranker, fine-tune the answer model. None of it gets you above the ceiling a bad chunker has set, because the right chunk simply isn't in the index in a usable shape.

For a marketing chatbot, a ceiling is annoying. For a system used to inform billable advice, it's the difference between a tool practitioners build a workflow around and a tool they have to second-guess every time. We hold ourselves to a high bar on retrieval quality not because the bar is fashionable, but because anything below it gets quietly ignored by the people we built the system for.

Chunking well is unfashionable. It has no beautiful demo. It is invisible when it works and invisible when it fails. But it decides whether the rest of the system has anything to stand on. Skip it and you have built a beautiful pipeline on top of an index that doesn't know what it contains.

If you're staring at your own corpus — contracts, filings, manuals, anything regulated — and wondering whether your chunker is the layer that's quietly capping you, that's the kind of conversation our 30-min calls are for. Book one.