Best Chunking Practices
- Anand Nerurkar
- Jan 9
- 2 min read
1. Chunk by Semantic Boundaries (NOT fixed size only)
Split by sections, headings, paragraphs, or logical units.
Avoid cutting a sentence or concept in half.
Works best with docs, tech specs, policies, manuals.
Why: Models retrieve more accurate and meaningful context.
2. Use Hybrid Chunking (Semantic + Token Length)
First split semantically.
THEN enforce a max token limit (e.g., 300–500 tokens).
Drop chunks that are too small (e.g., <50 tokens).
Ideal size:➡️ 300–500 tokens (or 800–1200 characters)
Why:This keeps chunks meaningful but also LLM-friendly.
3. Overlap Moderately
Add 10–20% overlap so context doesn’t break.
Example:
Chunk size: 400 tokens
Overlap: 40–60 tokens
Why:Reduces hallucination caused by missing context between chunks.
4. Use Metadata Everywhere
Attach metadata like:
Title
Section
Page number
Document type
Source URL/date
Version
Author
Why:Metadata improves retrieval ranking and grounding.
5. Keep Each Chunk as a Standalone Answer
Every chunk should:
Answer at least one question independently
Maintain complete meaning
Avoid starting/ending mid-topic unless unavoidable
6. Avoid Oversized Chunks
❌ Do NOT create 1,000–2,000 token chunks by defaultModels perform poorly because:
Irrelevant content gets mixed
Ranking gets weak
Retrieval cost increases
7. Avoid Undersized Chunks
❌ Small chunks (<100 tokens)
Lose meaning
Increase noise
Increase embedding cost
8. Deduplicate & Clean Before Chunking
Always pre-process:
Remove headers/footers repeating on every page
Normalize formatting
Remove special characters
Collapse empty lines
Fix broken sentences
9. Chunk Differently by Content Type
Documents (policies, PDFs) → 300–500 token semantic chunks
Code → split by functions/classes
Tables → preserve whole rows or logical segments
JSON/XML → chunk by node/block
Presentations → chunk slide-wise
Chat logs → chunk by session threads
10. Evaluate Chunk Quality
Test with three questions:
Can the chunk be retrieved correctly for a query?
Does chunk carry complete meaning?
Would splitting further lose semantic clarity?
If YES to all → good chunk.
💡 Recommended Configuration (Enterprise Standard)
Semantic split first
Max tokens: 350–450
Overlap: 50 tokens
Metadata: full document context
Hybrid chunking for PDFs
Special logic for tables/code
.png)

Comments