top of page

Best Chunking Practices

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Jan 9
  • 2 min read


1. Chunk by Semantic Boundaries (NOT fixed size only)

  • Split by sections, headings, paragraphs, or logical units.

  • Avoid cutting a sentence or concept in half.

  • Works best with docs, tech specs, policies, manuals.

Why: Models retrieve more accurate and meaningful context.

2. Use Hybrid Chunking (Semantic + Token Length)

  • First split semantically.

  • THEN enforce a max token limit (e.g., 300–500 tokens).

  • Drop chunks that are too small (e.g., <50 tokens).

Ideal size:➡️ 300–500 tokens (or 800–1200 characters)

Why:This keeps chunks meaningful but also LLM-friendly.

3. Overlap Moderately

  • Add 10–20% overlap so context doesn’t break.

  • Example:

    • Chunk size: 400 tokens

    • Overlap: 40–60 tokens

Why:Reduces hallucination caused by missing context between chunks.

4. Use Metadata Everywhere

Attach metadata like:

  • Title

  • Section

  • Page number

  • Document type

  • Source URL/date

  • Version

  • Author

Why:Metadata improves retrieval ranking and grounding.

5. Keep Each Chunk as a Standalone Answer

Every chunk should:

  • Answer at least one question independently

  • Maintain complete meaning

  • Avoid starting/ending mid-topic unless unavoidable

6. Avoid Oversized Chunks

❌ Do NOT create 1,000–2,000 token chunks by defaultModels perform poorly because:

  • Irrelevant content gets mixed

  • Ranking gets weak

  • Retrieval cost increases

7. Avoid Undersized Chunks

❌ Small chunks (<100 tokens)

  • Lose meaning

  • Increase noise

  • Increase embedding cost

8. Deduplicate & Clean Before Chunking

Always pre-process:

  • Remove headers/footers repeating on every page

  • Normalize formatting

  • Remove special characters

  • Collapse empty lines

  • Fix broken sentences

9. Chunk Differently by Content Type

Documents (policies, PDFs) → 300–500 token semantic chunks

Code → split by functions/classes

Tables → preserve whole rows or logical segments

JSON/XML → chunk by node/block

Presentations → chunk slide-wise

Chat logs → chunk by session threads

10. Evaluate Chunk Quality

Test with three questions:

  1. Can the chunk be retrieved correctly for a query?

  2. Does chunk carry complete meaning?

  3. Would splitting further lose semantic clarity?

If YES to all → good chunk.

💡 Recommended Configuration (Enterprise Standard)

  • Semantic split first

  • Max tokens: 350–450

  • Overlap: 50 tokens

  • Metadata: full document context

  • Hybrid chunking for PDFs

  • Special logic for tables/code




 
 
 

Recent Posts

See All
Future State Architecture

USE CASE: LARGE RETAIL BANK – DIGITAL CHANNEL MODERNIZATION 🔹 Business Context A large retail bank wants to “modernize” its digital channels (internet banking + mobile apps). Constraints: Heavy regul

 
 
 
Prompt

1️⃣ Zero-Shot Prompt What it is: No examples, just instructions. When to use: Simple, well-defined tasks Fast responses When model already “knows” the domain Example: “Summarize this loan agreement in

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page