RAG/

Anand Nerurkar
Nov 25
7 min read

1. RAG PIPELINES — Enterprise-Grade Reference Architecture

RAG in enterprise = 4 layers

Ingestion & Preprocessing
Indexing & Storage
Retrieval & Ranking
Generation & Guardrails

1.1 RAG Pipeline — End-to-End Architecture

A. Document Ingestion Layer

OCR (AWS Textract / Azure Form Recognizer / Tesseract)
PII masking (Rule-based + ML-based)
Document classification (SVM/BERT/LLMs)
Chunking (semantic-aware: sentences, headings)
Normalization (clean, dedupe, flatten PDFs)

B. Embedding + Indexing Layer

Embedding model selection:
- Instruction-based embeddings (OpenAI text-embedding-3-large)
- Domain fine-tuned embeddings (finance, AML, onboarding)
Metadata:
- doc_id
- version
- policy_type
- validity_date
- regulatory_flag
Vector DB choices:
- Postgres + pgvector (regulated BFSI)
- Pinecone
- Weaviate
- Milvus

C. Retrieval Layer

Hybrid Retrieval
- Vector search
- BM25
- Dense + Sparse Fusion
Re-ranking
- Cross-encoder (e.g., bge-reranker)
- LLM re-ranker (costly but accurate)
Retrieval Filters
- Recency filter (updated policy only)
- Version filter
- Tenant filter (ICICI vs HDFC)

D. Generation Layer

Response synthesis
Policy-grounded LLM
Answer reliability scoring
Hallucination detection
- Proximity score threshold
- Coverage test
- Answer consistency check

🟩 1.2 RAG Key Design Principles

1. Vector Consistency

Invalidate & rebuild embeddings when policy version changes
Maintain index freshness SLA (e.g., 5 minutes after update)

2. Retrieval Safety

“Grounded Only Mode” →If retrieval returns < similarity threshold →
“Answer not found in policy.”

3. Observability

Log: retrieval candidates, final chunks, hallucination score

2. MULTI-AGENT WORKFLOWS — Enterprise Architecture Blueprint

AI Systems are shifting from one large model → multiple small agents, each with a role.

2.1 Multi-Agent Types

1. Orchestrator Agent

Top-level planner
Breaks tasks into sub-tasks
Decides which agent handles each step
Ensures compliance + governance

2. Specialist Agents

Domain expert agent (e.g., Lending Policy Agent)
KYC agent
Risk decision agent
Fraud scoring agent
Tech architecture agent
SQL/data extraction agent
Code generation agent

3. Tool Agents

OCR agent
Vector DB agent
Search agent
API caller agent
ETL/data prep agent

4. Guardrail & Safety Agents

Policy compliance checker
PII auditor
Hallucination detector
Version consistency checker

2.2 Multi-Agent Workflow – Example (Digital Lending)

User: "Tell me whether this candidate is eligible for loan."

Step-by-step Flow

Orchestrator Agent
- Detects need for: OCR, vector DB retrieval, risk scoring
- Creates workflow plan
OCR Agent
- Extracts text from KYC PDF
Data Extraction Agent
- Extracts name, PAN, salary, employment type
Policy Retrieval Agent (RAG)
- Retrieves lending criteria from vector DB
Credit Score Agent
- Calls score service
Risk Decision Agent
- Combines OCR + data + rules + policy + risk models
Compliance Agent
- Ensures decision is policy-grounded
Response Generator Agent
- Produces the final explanation

2.3 Multi-Agent Patterns

A. ReAct (Reason → Act → Observe → Refine)

Use when tasks need iterative reasoning.

B. Hierarchical Agents

One “boss”, many “workers”.

C. Swarm (Autonomous Collaboration)

Agents message each other to refine outputs.

D. Toolformer Pattern

LLM chooses tools dynamically.

2.4 Multi-Agent Guardrails

Task deduplication
Loop detection
Maximum depth per agent
Cross-agent memory
Structured communication (“thought”, “action”, “observation”)
Hallucination scoring per agent

3. MODEL EVALUATION HARNESS — LLMOps Architecture

A Model Evaluation Harness ensures your models are:✔ reliable✔ accurate✔ grounded✔ safe✔ robust

This is mandatory for BFSI, lending, onboarding, fraud.

3.1 Types of Evaluation

1. Functional Evaluation

Correctness
Completeness
Clarity

2. Groundedness Evaluation

Based only on retrieved context
Compute:
- Faithfulness
- Relevance
- Coverage

3. Safety Evaluation

Bias testing
PII protection
Regulatory compliance

4. Adversarial / Red Team Testing

Injection attacks
Prompt jailbreaks
Policy override attempts
Refusal testing

5. Latency Evaluation

Time to retrieval
Time to first token
End-to-end latency

3.2 Evaluation Harness Architecture

User Query 
   │
   ▼
LLM Pipeline Under Test
   │
   ├──> Capture Retrieval Chunks
   ├──> Capture Model Output
   └──> Capture Thought/Reasoning (hidden)
   │
   ▼
Evaluation Runner
   │
   ├── Functional Tests
   ├── Groundedness Tests
   ├── Safety Tests
   ├── Red Team Tests
   └── Regression Tests
   │
   ▼
Metrics & Dashboard

3.3 Evaluation Metrics (Enterprise)

Functional

Answer correctness
Completeness score
Answer length deviation

Groundedness

Chunk coverage (%)
Faithfulness score
Retrieval relevance

Safety

PII leakage
Offensive content
Regulatory-compliance score

Red Team

Jailbreak resistance
Prompt-injection susceptibility

Performance

TTFT
Tokens used
Cost per query

3.4 Harness Outputs

Pass/Fail summary
Detailed failure cases
Explainability report
Policy grounding heat-map
Regression drift chart
Model version comparison

3.5 When to Run Evaluation Harness

Before deployment
Before policy change
After embedding refresh
Daily scheduled run
Before customer demo
Before releasing new agent

“We built a context-integrity microservice to solve three enterprise problems with LLMs: token explosion, context drift, and untrusted retrievals. The service stores a canonical session state (session_id, step_id, policy & embedding versions, active chunk pointers) in a lightweight hot store (Redis) with durable snapshots in Postgres. We roll older conversation into compact rolling summaries using a summarizer worker so the model gets only the essential state plus the last 3–5 messages.

For RAG we enforce metadata filters (tenant, policy_version, embedding_version) and a minimum similarity threshold so the model can only base answers on verified chunks. We detect semantic drift by comparing prompt embeddings with the last-context embedding — if similarity falls below 0.65 we rehydrate state and re-run retrievals.

To prevent loops and duplicate work we use idempotent task keys and strict tool response schemas; automatic retries are limited to a single retry flagged by the tool. Finally, a Model Evaluation Harness captures retrieval candidates and model outputs for functional, groundedness, safety, and adversarial testing, enabling regression detection and compliance reporting. This design achieves robust, auditable, and low-cost LLM operations for regulated financial workflows.”

✅ What type of chunking is best? (Short Answer)

Hybrid Semantic + Recursive chunking is currently the most reliable and production-proven approach for 95% of enterprise RAG workloads.

But the best chunking depends on:

Document type (policies, contracts, logs, code, emails)
Downstream task (search vs QA vs reasoning vs extraction)
LLM size/window
Retrieval architecture (RAG vs RAG-Fusion vs ColBERT)

✅ Top 7 Chunking Strategies (w/ When to Use Each)

1. Fixed-size chunking (e.g., 500–1000 tokens)

How it works: break text by token count.Pros: simple, stable performance, baseline.Cons: may split semantic units; needs overlap.

Use when:

Logs, emails, transcripts
High-volume ingestion
Simpler RAG Q&A
Perfect for fallback chunker

2. Semantic / Embedding-based chunking

How it works: split text based on semantic boundaries (embedding similarity drops).Pros: preserves meaning; fewer hallucinations.Cons: compute-heavy on ingestion.

Use when:

Policies, legal docs, contracts, standards
Banking circulars, RBI guidelines
Documents with irregular structure
Knowledge retrieval with high accuracy requirements

3. Recursive Hierarchical Chunking (RHC) — recommended default

How it works:

Split by large structural boundaries (H1, H2, sections)
If too large, split by paragraphs
If still large, split by sentences
Only last fallback: fixed tokens

Pros:

Follows document structure
High answer accuracy
Low hallucination rates
Best for long PDF/policies

Use when:

PDFs with hierarchy
Long-form documents (policies, manuals, SOPs)
Multi-agent RAG workflows
Banking/insurance policy ingestion

This is the industry standard (OpenAI Cookbook, LangChain, LlamaIndex).

4. Sliding Window / Overlap Chunking (20–30% overlap)

How it works: each chunk overlaps with previous/next.Pros: preserves cross-sentence context, improves QA grounding.Cons: more storage + compute.

Use when:

High-stakes QA (compliance, legal, contracts)
When answers depend on contextual flow
Multi-sentence reasoning tasks

5. Semantic Graph Chunking (advanced)

How it works: create nodes from paragraphs; edges based on semantic coherence.Pros: amazing for cross-referencing, multi-hop QA.Cons: expensive, complex.

Use when:

Multi-hop reasoning
Large knowledge graphs
Enterprise search at scale

Used by Google’s RETRO and GraphRAG.

6. Layout-aware Chunking (for PDFs, forms, tables)

How it works: preserves spatial structure (x/y coordinates) using OCR metadata.Pros: best for complex PDFs.Cons: requires OCR toolchain.

Use when:

Bank statements
Insurance forms
Invoices
PDF with tables and footnotes

In GenAI Production: must use for forms.

7. Code-aware chunking

How it works: split at logical boundaries (classes, functions, imports).Use when:

Code assistants
Internal engineering knowledge-bases

✅ What Are Embeddings?

Embeddings are numerical representations of text, images, documents, or objects that capture their meaning, context, and relationships — encoded as high-dimensional vectors.

Example:“Loan eligibility” → [0.234, -0.554, 0.192, ...] (1536-D vector)

Two concepts that “mean similar things” end up near each other in vector space.

📌 Why Enterprises Use Embeddings (Simple → Deep)

⭐ 1. Semantic Search (RAG)

Instead of keyword search, embeddings let you search by meaning.

Query:“maximum LTV allowed for salaried customers”

Vector search retrieves the correct policy rule even if exact words differ.

Used in:

Lending policies
KYC rules
RBI circulars
SOPs
Operational checklists
MF/Insurance compliance

⭐ 2. Document Understanding at Scale

Embeddings let enterprises convert large PDFs, emails, contracts, KYC docs into searchable numeric vectors.

Works across:

Policies
SOPs
Process documents
Product guidelines
Risk frameworks
Training materials

⭐ 3. Multi-Agent Systems Need Embeddings to Share Knowledge

Agents use embedding-based retrieval to store and fetch:

decisions
constraints
conversation context
memory states
policies
customer profiles

Without embeddings → agents forget context or hallucinate.

⭐ 4. Grounding LLMs → Reduce Hallucination by 60–80%

LLMs hallucinate because they rely on general training knowledge.Enterprises want factual answers based on private documents (policies, rules).

Embeddings let you:

Store your documents in vector DB
Retrieve the exact chunks relevant to the question
Feed back into LLM
Get grounded, policy-correct output

This is the core of RAG (Retrieval-Augmented Generation).

⭐ 5. Matching, Classification & Clustering

Embeddings allow systems to identify:

Similar customers
Similar claims
Similar credit behaviors
Similar transactions (fraud)
Similar disputes
Duplicate documents
Similar complaints

This reduces operational workload by 40–60%.

⭐ 6. Risk Analytics & Fraud Detection

Embedding-based ML detects patterns better than rule-based systems.

Examples:

Similar fraud patterns across accounts
Similar unusual income flows
Similar document tampering signals

Embeddings allow you to detect latent risk, not just explicit rules.

⭐ 7. Personalization & Recommendations

Enterprises use embeddings for:

Personal finance advice
Mutual fund recommendations
Insurance riders
Fraud dispute actions
Ticket routing

All done through similarity.

⭐ 8. Cross-Document Reasoning in Lending & KYC

To evaluate a loan, an agent must “connect”:

KYC identity
Income stability
Bank patterns
Lending policy
Product rules
Exceptions

Embeddings allow the system to:

✔ fetch the right policy✔ understand user profile✔ apply relevant rules✔ justify reasoning

📌 Why Are Embeddings Better Than Keyword Search?

Feature	Keyword Search	Embeddings
Understand meaning	❌ no	✅ yes
Handle synonyms	❌ no	✅ yes
Understand context	❌ no	✅ yes
Find related policies	❌ poor	✅ excellent
Multi-language	❌ no	✅ yes
Fuzzy matching	❌ manual	✅ built-in
Cross-document reasoning	❌ difficult	✅ natural

📌 Where Embeddings Fit in Enterprise Architecture

Input → Chunking → Embedding → Vector DB → Retrieval → LLM → Output

Works with:

Spring AI
LangChain4j
Azure AI Search
Pinecone
Qdrant
pgvector
Weaviate

Embeddings are the foundation of enterprise GenAI.

🔥 Short Executive Summary for Interviews

“Embeddings convert enterprise documents into numeric vectors that capture meaning, not keywords. This enables semantic search, policy reasoning, multi-agent coordination, and factual RAG. It dramatically reduces hallucination, improves accuracy, and allows enterprises to use LLMs safely on private data.”

1. RAG PIPELINES — Enterprise-Grade Reference Architecture

1.1 RAG Pipeline — End-to-End Architecture

A. Document Ingestion Layer

B. Embedding + Indexing Layer

C. Retrieval Layer

D. Generation Layer

🟩 1.2 RAG Key Design Principles

1. Vector Consistency

2. Retrieval Safety

3. Observability

2. MULTI-AGENT WORKFLOWS — Enterprise Architecture Blueprint

2.1 Multi-Agent Types

1. Orchestrator Agent

2. Specialist Agents

3. Tool Agents

4. Guardrail & Safety Agents

2.2 Multi-Agent Workflow – Example (Digital Lending)

Step-by-step Flow

2.3 Multi-Agent Patterns

A. ReAct (Reason → Act → Observe → Refine)

B. Hierarchical Agents

C. Swarm (Autonomous Collaboration)

D. Toolformer Pattern

2.4 Multi-Agent Guardrails

3. MODEL EVALUATION HARNESS — LLMOps Architecture

3.1 Types of Evaluation

1. Functional Evaluation

2. Groundedness Evaluation

3. Safety Evaluation

4. Adversarial / Red Team Testing

5. Latency Evaluation

3.2 Evaluation Harness Architecture

3.3 Evaluation Metrics (Enterprise)

Functional

Groundedness

Safety

Red Team

Performance

3.4 Harness Outputs

3.5 When to Run Evaluation Harness

✅ What type of chunking is best? (Short Answer)

✅ Top 7 Chunking Strategies (w/ When to Use Each)

1. Fixed-size chunking (e.g., 500–1000 tokens)

2. Semantic / Embedding-based chunking

3. Recursive Hierarchical Chunking (RHC) — recommended default

4. Sliding Window / Overlap Chunking (20–30% overlap)

5. Semantic Graph Chunking (advanced)

6. Layout-aware Chunking (for PDFs, forms, tables)

7. Code-aware chunking

✅ What Are Embeddings?

📌 Why Enterprises Use Embeddings (Simple → Deep)

⭐ 1. Semantic Search (RAG)

⭐ 2. Document Understanding at Scale

⭐ 3. Multi-Agent Systems Need Embeddings to Share Knowledge

⭐ 4. Grounding LLMs → Reduce Hallucination by 60–80%

⭐ 5. Matching, Classification & Clustering

⭐ 6. Risk Analytics & Fraud Detection

⭐ 7. Personalization & Recommendations

⭐ 8. Cross-Document Reasoning in Lending & KYC

📌 Why Are Embeddings Better Than Keyword Search?

📌 Where Embeddings Fit in Enterprise Architecture

🔥 Short Executive Summary for Interviews

Comments