Best Chunking Practices

Anand Nerurkar
Jan 9
2 min read

1. Chunk by Semantic Boundaries (NOT fixed size only)

Split by sections, headings, paragraphs, or logical units.
Avoid cutting a sentence or concept in half.
Works best with docs, tech specs, policies, manuals.

Why: Models retrieve more accurate and meaningful context.

2. Use Hybrid Chunking (Semantic + Token Length)

First split semantically.
THEN enforce a max token limit (e.g., 300–500 tokens).
Drop chunks that are too small (e.g., <50 tokens).

Ideal size:➡️ 300–500 tokens (or 800–1200 characters)

Why:This keeps chunks meaningful but also LLM-friendly.

3. Overlap Moderately

Add 10–20% overlap so context doesn’t break.
Example:
- Chunk size: 400 tokens
- Overlap: 40–60 tokens

Why:Reduces hallucination caused by missing context between chunks.

4. Use Metadata Everywhere

Attach metadata like:

Title
Section
Page number
Document type
Source URL/date
Version
Author

Why:Metadata improves retrieval ranking and grounding.

5. Keep Each Chunk as a Standalone Answer

Every chunk should:

Answer at least one question independently
Maintain complete meaning
Avoid starting/ending mid-topic unless unavoidable

6. Avoid Oversized Chunks

❌ Do NOT create 1,000–2,000 token chunks by defaultModels perform poorly because:

Irrelevant content gets mixed
Ranking gets weak
Retrieval cost increases

7. Avoid Undersized Chunks

❌ Small chunks (<100 tokens)

Lose meaning
Increase noise
Increase embedding cost

8. Deduplicate & Clean Before Chunking

Always pre-process:

Remove headers/footers repeating on every page
Normalize formatting
Remove special characters
Collapse empty lines
Fix broken sentences

9. Chunk Differently by Content Type

Documents (policies, PDFs) → 300–500 token semantic chunks

Code → split by functions/classes

Tables → preserve whole rows or logical segments

JSON/XML → chunk by node/block

Presentations → chunk slide-wise

Chat logs → chunk by session threads

10. Evaluate Chunk Quality

Test with three questions:

Can the chunk be retrieved correctly for a query?
Does chunk carry complete meaning?
Would splitting further lose semantic clarity?

If YES to all → good chunk.

💡 Recommended Configuration (Enterprise Standard)

Semantic split first
Max tokens: 350–450
Overlap: 50 tokens
Metadata: full document context
Hybrid chunking for PDFs
Special logic for tables/code