RAG/
- Anand Nerurkar
- Nov 25
- 14 min read
1. RAG PIPELINES — Enterprise-Grade Reference Architecture
RAG in enterprise = 4 layers
Ingestion & Preprocessing
Indexing & Storage
Retrieval & Ranking
Generation & Guardrails
1.1 RAG Pipeline — End-to-End Architecture
A. Document Ingestion Layer
OCR (AWS Textract / Azure Form Recognizer / Tesseract)
PII masking (Rule-based + ML-based)
Document classification (SVM/BERT/LLMs)
Chunking (semantic-aware: sentences, headings)
Normalization (clean, dedupe, flatten PDFs)
B. Embedding + Indexing Layer
Embedding model selection:
Instruction-based embeddings (OpenAI text-embedding-3-large)
Domain fine-tuned embeddings (finance, AML, onboarding)
Metadata:
doc_id
version
policy_type
validity_date
regulatory_flag
Vector DB choices:
Postgres + pgvector (regulated BFSI)
Pinecone
Weaviate
Milvus
C. Retrieval Layer
Hybrid Retrieval
Vector search
BM25
Dense + Sparse Fusion
Re-ranking
Cross-encoder (e.g., bge-reranker)
LLM re-ranker (costly but accurate)
Retrieval Filters
Recency filter (updated policy only)
Version filter
Tenant filter (ICICI vs HDFC)
D. Generation Layer
Response synthesis
Policy-grounded LLM
Answer reliability scoring
Hallucination detection
Proximity score threshold
Coverage test
Answer consistency check
🟩 1.2 RAG Key Design Principles
1. Vector Consistency
Invalidate & rebuild embeddings when policy version changes
Maintain index freshness SLA (e.g., 5 minutes after update)
2. Retrieval Safety
“Grounded Only Mode” →If retrieval returns < similarity threshold →
“Answer not found in policy.”
3. Observability
Log: retrieval candidates, final chunks, hallucination score
2. MULTI-AGENT WORKFLOWS — Enterprise Architecture Blueprint
AI Systems are shifting from one large model → multiple small agents, each with a role.
2.1 Multi-Agent Types
1. Orchestrator Agent
Top-level planner
Breaks tasks into sub-tasks
Decides which agent handles each step
Ensures compliance + governance
2. Specialist Agents
Domain expert agent (e.g., Lending Policy Agent)
KYC agent
Risk decision agent
Fraud scoring agent
Tech architecture agent
SQL/data extraction agent
Code generation agent
3. Tool Agents
OCR agent
Vector DB agent
Search agent
API caller agent
ETL/data prep agent
4. Guardrail & Safety Agents
Policy compliance checker
PII auditor
Hallucination detector
Version consistency checker
2.2 Multi-Agent Workflow – Example (Digital Lending)
User: "Tell me whether this candidate is eligible for loan."
Step-by-step Flow
Orchestrator Agent
Detects need for: OCR, vector DB retrieval, risk scoring
Creates workflow plan
OCR Agent
Extracts text from KYC PDF
Data Extraction Agent
Extracts name, PAN, salary, employment type
Policy Retrieval Agent (RAG)
Retrieves lending criteria from vector DB
Credit Score Agent
Calls score service
Risk Decision Agent
Combines OCR + data + rules + policy + risk models
Compliance Agent
Ensures decision is policy-grounded
Response Generator Agent
Produces the final explanation
2.3 Multi-Agent Patterns
A. ReAct (Reason → Act → Observe → Refine)
Use when tasks need iterative reasoning.
B. Hierarchical Agents
One “boss”, many “workers”.
C. Swarm (Autonomous Collaboration)
Agents message each other to refine outputs.
D. Toolformer Pattern
LLM chooses tools dynamically.
2.4 Multi-Agent Guardrails
Task deduplication
Loop detection
Maximum depth per agent
Cross-agent memory
Structured communication (“thought”, “action”, “observation”)
Hallucination scoring per agent
3. MODEL EVALUATION HARNESS — LLMOps Architecture
A Model Evaluation Harness ensures your models are:✔ reliable✔ accurate✔ grounded✔ safe✔ robust
This is mandatory for BFSI, lending, onboarding, fraud.
3.1 Types of Evaluation
1. Functional Evaluation
Correctness
Completeness
Clarity
2. Groundedness Evaluation
Based only on retrieved context
Compute:
Faithfulness
Relevance
Coverage
3. Safety Evaluation
Bias testing
PII protection
Regulatory compliance
4. Adversarial / Red Team Testing
Injection attacks
Prompt jailbreaks
Policy override attempts
Refusal testing
5. Latency Evaluation
Time to retrieval
Time to first token
End-to-end latency
3.2 Evaluation Harness Architecture
User Query
│
▼
LLM Pipeline Under Test
│
├──> Capture Retrieval Chunks
├──> Capture Model Output
└──> Capture Thought/Reasoning (hidden)
│
▼
Evaluation Runner
│
├── Functional Tests
├── Groundedness Tests
├── Safety Tests
├── Red Team Tests
└── Regression Tests
│
▼
Metrics & Dashboard
3.3 Evaluation Metrics (Enterprise)
Functional
Answer correctness
Completeness score
Answer length deviation
Groundedness
Chunk coverage (%)
Faithfulness score
Retrieval relevance
Safety
PII leakage
Offensive content
Regulatory-compliance score
Red Team
Jailbreak resistance
Prompt-injection susceptibility
Performance
TTFT
Tokens used
Cost per query
3.4 Harness Outputs
Pass/Fail summary
Detailed failure cases
Explainability report
Policy grounding heat-map
Regression drift chart
Model version comparison
3.5 When to Run Evaluation Harness
Before deployment
Before policy change
After embedding refresh
Daily scheduled run
Before customer demo
Before releasing new agent
“We built a context-integrity microservice to solve three enterprise problems with LLMs: token explosion, context drift, and untrusted retrievals. The service stores a canonical session state (session_id, step_id, policy & embedding versions, active chunk pointers) in a lightweight hot store (Redis) with durable snapshots in Postgres. We roll older conversation into compact rolling summaries using a summarizer worker so the model gets only the essential state plus the last 3–5 messages.
For RAG we enforce metadata filters (tenant, policy_version, embedding_version) and a minimum similarity threshold so the model can only base answers on verified chunks. We detect semantic drift by comparing prompt embeddings with the last-context embedding — if similarity falls below 0.65 we rehydrate state and re-run retrievals.
To prevent loops and duplicate work we use idempotent task keys and strict tool response schemas; automatic retries are limited to a single retry flagged by the tool. Finally, a Model Evaluation Harness captures retrieval candidates and model outputs for functional, groundedness, safety, and adversarial testing, enabling regression detection and compliance reporting. This design achieves robust, auditable, and low-cost LLM operations for regulated financial workflows.”
✅ What type of chunking is best? (Short Answer)
Hybrid Semantic + Recursive chunking is currently the most reliable and production-proven approach for 95% of enterprise RAG workloads.
But the best chunking depends on:
Document type (policies, contracts, logs, code, emails)
Downstream task (search vs QA vs reasoning vs extraction)
LLM size/window
Retrieval architecture (RAG vs RAG-Fusion vs ColBERT)
✅ Top 7 Chunking Strategies (w/ When to Use Each)
1. Fixed-size chunking (e.g., 500–1000 tokens)
How it works: break text by token count.Pros: simple, stable performance, baseline.Cons: may split semantic units; needs overlap.
Use when:
Logs, emails, transcripts
High-volume ingestion
Simpler RAG Q&A
Perfect for fallback chunker
2. Semantic / Embedding-based chunking
How it works: split text based on semantic boundaries (embedding similarity drops).Pros: preserves meaning; fewer hallucinations.Cons: compute-heavy on ingestion.
Use when:
Policies, legal docs, contracts, standards
Banking circulars, RBI guidelines
Documents with irregular structure
Knowledge retrieval with high accuracy requirements
3. Recursive Hierarchical Chunking (RHC) — recommended default
How it works:
Split by large structural boundaries (H1, H2, sections)
If too large, split by paragraphs
If still large, split by sentences
Only last fallback: fixed tokens
Pros:
Follows document structure
High answer accuracy
Low hallucination rates
Best for long PDF/policies
Use when:
PDFs with hierarchy
Long-form documents (policies, manuals, SOPs)
Multi-agent RAG workflows
Banking/insurance policy ingestion
This is the industry standard (OpenAI Cookbook, LangChain, LlamaIndex).
4. Sliding Window / Overlap Chunking (20–30% overlap)
How it works: each chunk overlaps with previous/next.Pros: preserves cross-sentence context, improves QA grounding.Cons: more storage + compute.
Use when:
High-stakes QA (compliance, legal, contracts)
When answers depend on contextual flow
Multi-sentence reasoning tasks
5. Semantic Graph Chunking (advanced)
How it works: create nodes from paragraphs; edges based on semantic coherence.Pros: amazing for cross-referencing, multi-hop QA.Cons: expensive, complex.
Use when:
Multi-hop reasoning
Large knowledge graphs
Enterprise search at scale
Used by Google’s RETRO and GraphRAG.
6. Layout-aware Chunking (for PDFs, forms, tables)
How it works: preserves spatial structure (x/y coordinates) using OCR metadata.Pros: best for complex PDFs.Cons: requires OCR toolchain.
Use when:
Bank statements
Insurance forms
Invoices
PDF with tables and footnotes
In GenAI Production: must use for forms.
7. Code-aware chunking
How it works: split at logical boundaries (classes, functions, imports).Use when:
Code assistants
Internal engineering knowledge-bases
✅ What Are Embeddings?
Embeddings are numerical representations of text, images, documents, or objects that capture their meaning, context, and relationships — encoded as high-dimensional vectors.
Example:“Loan eligibility” → [0.234, -0.554, 0.192, ...] (1536-D vector)
Two concepts that “mean similar things” end up near each other in vector space.
📌 Why Enterprises Use Embeddings (Simple → Deep)
⭐ 1. Semantic Search (RAG)
Instead of keyword search, embeddings let you search by meaning.
Query:“maximum LTV allowed for salaried customers”
Vector search retrieves the correct policy rule even if exact words differ.
Used in:
Lending policies
KYC rules
RBI circulars
SOPs
Operational checklists
MF/Insurance compliance
⭐ 2. Document Understanding at Scale
Embeddings let enterprises convert large PDFs, emails, contracts, KYC docs into searchable numeric vectors.
Works across:
Policies
SOPs
Process documents
Product guidelines
Risk frameworks
Training materials
⭐ 3. Multi-Agent Systems Need Embeddings to Share Knowledge
Agents use embedding-based retrieval to store and fetch:
decisions
constraints
conversation context
memory states
policies
customer profiles
Without embeddings → agents forget context or hallucinate.
⭐ 4. Grounding LLMs → Reduce Hallucination by 60–80%
LLMs hallucinate because they rely on general training knowledge.Enterprises want factual answers based on private documents (policies, rules).
Embeddings let you:
Store your documents in vector DB
Retrieve the exact chunks relevant to the question
Feed back into LLM
Get grounded, policy-correct output
This is the core of RAG (Retrieval-Augmented Generation).
⭐ 5. Matching, Classification & Clustering
Embeddings allow systems to identify:
Similar customers
Similar claims
Similar credit behaviors
Similar transactions (fraud)
Similar disputes
Duplicate documents
Similar complaints
This reduces operational workload by 40–60%.
⭐ 6. Risk Analytics & Fraud Detection
Embedding-based ML detects patterns better than rule-based systems.
Examples:
Similar fraud patterns across accounts
Similar unusual income flows
Similar document tampering signals
Embeddings allow you to detect latent risk, not just explicit rules.
⭐ 7. Personalization & Recommendations
Enterprises use embeddings for:
Personal finance advice
Mutual fund recommendations
Insurance riders
Fraud dispute actions
Ticket routing
All done through similarity.
⭐ 8. Cross-Document Reasoning in Lending & KYC
To evaluate a loan, an agent must “connect”:
KYC identity
Income stability
Bank patterns
Lending policy
Product rules
Exceptions
Embeddings allow the system to:
✔ fetch the right policy✔ understand user profile✔ apply relevant rules✔ justify reasoning
📌 Why Are Embeddings Better Than Keyword Search?
Feature | Keyword Search | Embeddings |
Understand meaning | ❌ no | ✅ yes |
Handle synonyms | ❌ no | ✅ yes |
Understand context | ❌ no | ✅ yes |
Find related policies | ❌ poor | ✅ excellent |
Multi-language | ❌ no | ✅ yes |
Fuzzy matching | ❌ manual | ✅ built-in |
Cross-document reasoning | ❌ difficult | ✅ natural |
📌 Where Embeddings Fit in Enterprise Architecture
Input → Chunking → Embedding → Vector DB → Retrieval → LLM → Output
Works with:
Spring AI
LangChain4j
Azure AI Search
Pinecone
Qdrant
pgvector
Weaviate
Embeddings are the foundation of enterprise GenAI.
🔥 Short Executive Summary for Interviews
“Embeddings convert enterprise documents into numeric vectors that capture meaning, not keywords. This enables semantic search, policy reasoning, multi-agent coordination, and factual RAG. It dramatically reduces hallucination, improves accuracy, and allows enterprises to use LLMs safely on private data.”
✅ 1. Per-Request Logs (Prompt + Config + Outputs)
You should log the following for every LLM request:
Category | What to Capture |
Inputs | Prompt, system message, user message, retrieved chunks |
Model Config | temperature, top_p, max_tokens, model name |
Outputs | generated tokens, token count, finish reason |
Latency | model latency, total latency |
Errors | model errors, timeouts |
Spring Boot Logging Interceptor Example
@Component
public class LLMRequestLogger {
private static final Logger log = LoggerFactory.getLogger(LLMRequestLogger.class);
public void logLLMRequest(String prompt,
Map<String, Object> modelConfig,
String response,
long latencyMs) {
log.info("LLM_REQUEST: {}", Map.of(
"prompt", prompt,
"model_config", modelConfig,
"response_tokens", response.length(),
"latency_ms", latencyMs
));
}
}
✅ 2. Orchestration Traces (Multi-agent Flow Logging)
For multi-agent systems (Spring AI + Agents), you must track:
What to Trace | Example |
Which agent executed | RouterAgent → KYCValidatorAgent |
Which tool was invoked | PAN_OCR_SERVICE, CREDIT_SCORE_API |
Tool input & output | Input PAN image, output structured JSON |
Success / failure | Tool call failed → retry |
Agent reasoning | (Store safely with filtered thoughts) |
Trace Event Model
public record AgentTraceEvent(
String agentName,
String toolName,
Object toolInput,
Object toolOutput,
long startTime,
long endTime,
boolean success
) {}
Trace Logger
@Component
public class AgentTraceLogger {
private static final Logger log = LoggerFactory.getLogger(AgentTraceLogger.class);
public void logTrace(AgentTraceEvent event) {
log.info("AGENT_TRACE_EVENT: {}", event);
}
}
✅ 3. Vector DB Access Logs (pgvector)
You must log:
Field | Description |
embedding model | e.g., text-embedding-3-large |
query vector hash | (don’t log entire vector) |
similarity score | min / max / threshold |
document id | which chunk retrieved |
metadata | section, policy version, chunk index |
Spring Boot pgvector Query Logging
public void logVectorQuery(String queryText, List<VectorResult> results) {
log.info("VECTOR_SEARCH: {}", Map.of(
"query", queryText,
"results", results.stream().map(r -> Map.of(
"doc_id", r.docId(),
"similarity", r.score(),
"metadata", r.metadata()
)).toList()
));
}
✅ 4. Business Metrics (Prometheus / Micrometer)
You should expose metrics like:
Operational Metrics
Metric | Insight |
llm.retries.count | how often model fails |
llm.steps.avg | number of agent steps per request |
llm.latency.histogram | p50, p95, p99 latency |
vector.search.time | retrieval bottlenecks |
tool.failures | unhealthy external APIs |
Business Metrics
Metric | Insight |
% escalation to human | how often automation fails |
% auto-approved KYC | automation success |
S2 transaction errors | stability |
Per-tenant SLA | SaaS platform health |
Micrometer Example
@Autowired
MeterRegistry registry;
public void recordMetrics(long llmLatencyMs, boolean escalated) {
registry.timer("llm.latency").record(llmLatencyMs, TimeUnit.MILLISECONDS);
if (escalated) {
registry.counter("llm.escalations").increment();
}
}
✅ Putting It Together: End-to-End Trace for a Request
REQUEST ID: 87u2-a91b
├─ User Prompt Logged
├─ Retrieved 4 chunks from pgvector
│ ├─ doc_id: policy-2024-12-chunk-11, score: 0.91
│ ├─ doc_id: policy-2024-12-chunk-12, score: 0.89
├─ LLM Request Started
│ ├─ model: gpt-4.1
│ ├─ temp: 0.2, top_p: 0.9
├─ Agent Router → “KYCValidationAgent”
├─ Tool Call: PAN_OCR_SERVICE
│ ├─ success=true
│ ├─ latency=123 ms
├─ Agent decides → “CreditCheckAgent”
├─ LLM Response Logged (tokens=412, latency=1.8s)
└─ Business Metrics Updated
├─ steps=3
├─ escalated=false
AI / GenAI Capability Building — Complete Prompt Cookbook
1. System Prompts (Persona, Governance, Identity)
1.1 AI Engineering Coach (Your Role)
You are an AI Engineering Coach helping enterprise teams build GenAI, ML, and automation capability. You teach best practices, enforce design governance, and mentor engineers with clarity and structure.
1.2 Enterprise Architect (GenAI-first)
You are an Enterprise Architect specializing in GenAI, RAG pipelines, secure microservices, and cloud-native design (AWS/Azure/GCP). Provide architecture-first answers.
1.3 Banking/Financial Domain Expert AI
You are a BFSI Domain AI with expertise in lending, KYC, onboarding, fraud detection, and regulatory compliance (RBI, SEBI). Always map answers to business outcomes.
1.4 LLMOps Expert
You are an LLMOps Architect ensuring pipelines for RAG, evaluation, red teaming, observability, and cost governance.
1.5 Senior Tech Delivery Manager
You are an Engineering Manager optimizing delivery, sprint metrics, governance KPIs, and team productivity.
2. Instruction Prompts (Task Prompts)
2.1 Explain architecture
Explain the architecture in 5 bullet points tailored for a CTO.
2.2 Convert business requirement → architecture flow
Convert this business problem into a modern cloud architecture.
2.3 Write governance checklist
Create a delivery + architecture governance checklist for this program.
2.4 Summarize long document
Summarize this 50-page policy into a 7-point executive brief.
2.5 Generate high-level APIs
Generate API definitions for each microservice in the journey.
3. Zero-Shot Prompts (Simple Q&A)
3.1 Describe RAG
Explain RAG to a non-technical stakeholder.
3.2 Describe embeddings
What are embeddings and why do enterprises use them?
3.3 Describe multi-agent architecture
Explain multi-agent workflows for financial automation.
3.4 Explain vector store
Explain vector DB in simple language.
3.5 Describe a microservice
Explain this microservice responsibility in one paragraph.
4. One-Shot Prompts
4.1 One example given → generate next
Here is one KYC workflow. Generate another with a slightly different scenario.
4.2 Jira story one-shot
Here is one Jira story. Generate a similar story for a different service.
4.3 Design pattern one-shot
Here is a circuit breaker pattern example. Generate a bulkhead pattern summary.
5. Few-Shot Prompts (Your strongest category)
5.1 Create structured architecture outputs
Using these 3 examples, create the same structure for a new use case.
5.2 Generate interview answers
Follow the style and structure of these sample interview answers.
5.3 Engineering standards
Follow these examples and create coding standards for backend + AI.
5.4 SOP generation
Follow these SOP examples and produce a new SOP for LLM evaluation.
5.5 Capability-building playbooks
Follow these examples to create a playbook for ML capability building.
6. Chain-of-Thought Prompts (Deep reasoning)
6.1 Architecture decision
Think step-by-step and evaluate all architecture options before concluding.
6.2 Root-cause analysis
Think step-by-step and identify the root cause of this failure.
6.3 Technical gap analysis
Think in steps and identify missing capabilities in the engineering team.
6.4 Policy conflict detection
Think critically and detect contradictions in this policy document.
6.5 Risk reasoning
Identify risks step-by-step across tech, people, delivery, and compliance.
7. Deliberate Prompts (Multi-solution thinking)
7.1 Solution comparison
Generate 3 possible solutions, compare them, and recommend one.
7.2 Architecture trade-off
Generate 3 architecture patterns and compare using scalability, cost, and complexity.
7.3 RAG approach comparison
Generate 3 RAG architecture variants and choose the best for BFSI.
7.4 LLMOps pipeline variants
Generate 3 LLMOps designs and compare operational trade-offs.
7.5 Decision rationalization
Generate multiple options and justify the selected one.
8. RAG Prompts (Retrieval + Reasoning)
8.1 Policy-based answering
Using only the context provided from the lending policy, answer the question.
8.2 Policy-version difference
Given old and new policies, summarize the differences relevant for engineers.
8.3 Strict grounding
Answer strictly based on the policy chunks — no assumptions.
8.4 Multi-document synthesis
Using the retrieved chunks from both RBI and internal SOP, synthesize a unified answer.
8.5 Compliance guard
Respond only if the answer is grounded in the vector DB; otherwise say “Not found in policy.”
9. Tool-Calling Prompts
9.1 OCR
If image/document is uploaded, call ocr_service.extract_text.
9.2 Vector DB update
If policy text changes, call vector_db.update_embeddings.
9.3 Knowledge search
If user asks domain question, call search.policies.
9.4 ML Model Execution
When numerical values provided, call risk_model.predict.
9.5 File generation
When asked for docs, call generate_pdf or generate_excel.
10. Self-Consistency Prompts
10.1 Majority voting
Generate 5 independent analyses and pick the most consistent answer.
10.2 Risk scoring
Give 3 independent risk scores and return the median.
10.3 Reasoning validation
Generate 3 explanations, compare them, and keep the best.
10.4 Edge-case detection
Generate 4 edge cases and pick the one with highest risk.
10.5 Coding fix validation
Generate 3 fixes and return the one with the least complexity.
11. ReAct Prompts
11.1 Reason + decide + tool call
Think what is required. If missing data, call search. If complete, answer.
11.2 Multi-step planning
Break the task into steps and call tools for each step.
11.3 Open-ended Q&A with tools
Use chain-of-thought privately and only show final answers/tool calls.
11.4 Agent workflow
Plan → act → observe → refine.
11.5 Verification loops
After answering, validate correctness via an internal check.
12. Output-Constrained / Structured Prompts
12.1 JSON only
Respond only in JSON matching this schema.
12.2 YAML config
Generate Kubernetes YAML for this service.
12.3 API spec
Generate OpenAPI spec from the description.
12.4 BPMN
Generate BPMN XML for this workflow.
12.5 Code-only
Give only code—no explanation.
🎯 BONUS: Best Practices for Prompt Selection (Your Context)
Use Case | Best Prompt Types |
AI capability building | System + Few-shot + Deliberate |
Banking domain QA | RAG + Output-constrained |
Architecture proposals | Chain-of-Thought + Deliberate |
ML/AI governance | Instruction + Structured |
Document automation | Tool-calling + ReAct |
Multi-agent flows | ReAct + System prompt |
Policy change automation | RAG + Tool-calling |
✅ 12 Types of Prompts — When to Use Each One
1. System Prompt (Persona / Rules / Identity Prompt)
Purpose: Controls the model’s role, behavior, tone, constraints, and boundaries.
Use when:
You need consistent behavior across a long interaction
Defining a persona (e.g., “You are an enterprise architect”)
Enforcing rules (never disclose… always respond with JSON…)
Example:
You are an Enterprise Architect specializing in GenAI capability building. Always respond concisely in bullet points.
2. Instruction Prompt (Task Prompt)
Purpose: Tells the model what to do.
Use when:
You need the model to perform a specific action
Summaries, classification, coding, testing
Example:
Explain the architecture in 7 bullet points.
3. Zero-Shot Prompt
Purpose: No examples given — model must infer the pattern.
Use when:
Task is simple
You want to avoid bias from examples
Generic Q&A, rewriting, explanations
Example:
Explain vector embeddings to a junior engineer.
4. One-Shot Prompt
Purpose: One example given.
Use when:
You want to show the expected output format
Not overfitting the model with many examples
Example:
Here is one example of a Jira story… create another similar story.
5. Few-Shot Prompt
Purpose: Provide 2–10 examples to teach a pattern.
Use when:
You need consistent structure
Output format is crucial (JSON, templates, policies)
Your task is domain-specific and the model must follow your style
Example:
Provide 3 examples of complaint → classification → resolution.
6. Chain-of-Thought Prompt
Purpose: Make the model reason step-by-step.
Use when:
Tasks require reasoning
Architecture decisions
Multi-step calculations
Scenario analysis
Example:
Think step-by-step and evaluate each architecture option before giving the final answer.
7. Deliberate Prompt (Multi-Thinking Prompt)
Purpose: Ask the model to form multiple candidate answers and pick the best.
Use when:
You want reliability, deeper reasoning
Prevent hallucinations
High-stakes decisions (architecture, legal, banking)
Example:
Generate 3 possible solutions, compare them, then produce the final recommended design.
8. Retrieval-Augmented Prompt (RAG Prompt)
Purpose: Insert retrieved chunks from vector DB.
Use when:
Query involves proprietary knowledge
Policy or SOP backed by documents
Context grounding is required
Example:
Using the context below from the lending policy… answer the user query.
9. Tool Calling Prompt
Purpose: Ask the model to select and call tools.
Use when:
Orchestration of external APIs
Multi-agent workflows
Database calls, calculations, document processes
Example:
Use ocr_service.extract_text when the user uploads a document…
10. Self-Consistency Prompt
Purpose: Sample the model multiple times → majority voting.
Use when:
Highly ambiguous tasks
You want accuracy boost
Mathematical or logical tasks
Example:
Generate 5 solutions independently and choose the most consistent answer.
11. ReAct Prompt (Reason + Act)
Purpose: Model reasons → decides → calls tools → continues.
Use when:
Reasoning and acting must be combined
Planning tasks
Agentic workflows (multi-step)
Example:
Think what you need next. If data missing, call search. If complete, answer.
12. Output-Constrained Prompt
Purpose: Force the LLM to output only a specific format.
Use when:
Integrating with downstream systems
JSON-only
YAML config
Code generation
Example:
Respond only with valid JSON matching this schema…
🎯 Quick Decision Table — When to Use Which Prompt
Situation | Use Prompt Type |
You want consistent persona | System Prompt |
You want a model to perform a task | Instruction |
You need reliability & deep reasoning | Chain-of-thought / Deliberate |
You need examples | Few-shot |
You want JSON output for API | Output-constrained |
Using enterprise documents | RAG Prompt |
Using tools / agents | Tool-calling / ReAct |
Need accuracy boost | Self-consistency |
Minimal input | Zero-shot |
.png)

Comments