Model Tiering- AI Cost Economics
- Anand Nerurkar
- Dec 18, 2025
- 6 min read
Updated: Mar 3
🧠 What is Model Tiering in GenAI?
Model tiering is an architectural strategy where multiple AI models of different sizes, costs, and capabilities are used together, and each request is routed to the most cost-effective model that can meet the requirement.
Not every query needs the most powerful (and expensive) model.
🎯 Why Model Tiering is Critical (Especially in BFSI)
Without tiering:
Every request hits a large LLM
Costs explode
Latency increases
Risk surface grows
With tiering:
60–75% of traffic handled by small models
Large models used only for complex cases
Predictable cost + better SLA
🏗️ Typical Model Tiers (Enterprise Reality)
Tier | Model Type | Usage |
Tier-0 | Rules / retrieval / templates | FAQs, static answers |
Tier-1 | Small / distilled LLMs | Summarization, classification |
Tier-2 | Medium LLMs | RAG, reasoning, analysis |
Tier-3 | Large / premium LLMs | Complex reasoning, edge cases |
🔀 How Do You Decide Which Tier to Use?
You decide based on 4 dimensions:
1️⃣ Task Complexity
Task | Tier |
Keyword lookup / FAQ | Tier-0 |
Simple summarization | Tier-1 |
Policy Q&A (RAG) | Tier-2 |
Multi-step reasoning | Tier-3 |
2️⃣ Risk & Compliance Sensitivity
Risk Level | Tier |
Low (internal ops) | Tier-1 / Tier-2 |
Medium (customer-facing) | Tier-2 |
High (credit, compliance) | Tier-2 + human |
Critical decisions | Human only |
In BFSI, GenAI supports decisions — it does not make them.
3️⃣ Latency & SLA
SLA | Tier |
<300 ms | Tier-0 / Tier-1 |
<800 ms | Tier-2 |
Async allowed | Tier-3 |
4️⃣ Cost Envelope
Cost Target | Tier |
<₹1 per inference | Tier-1 |
₹1–₹3 | Tier-2 |
₹5+ | Tier-3 |
🧭 Routing Logic (Enterprise Pattern)
Request →
Complexity Check →
Risk Classification →
SLA Requirement →
Budget Check →
Model Tier Selection →
Fallback / Escalation
📊 Realistic Banking Distribution (What Sounds Real)
Tier | Traffic % |
Tier-0 | 10–15% |
Tier-1 | 45–55% |
Tier-2 | 25–30% |
Tier-3 | 5–10% |
If someone says “most traffic goes to GPT-4”, they haven’t scaled GenAI.
💰 Impact of Model Tiering (Real Numbers)
Metric | Before | After |
Cost / inference | ₹3.8 | ₹1.9 |
Monthly AI spend | ₹5 Cr | ₹2.8 Cr |
P95 latency | 900 ms | 480 ms |
SLA breaches | Frequent | Rare |
🎤 Summary
“Model tiering is an architectural approach where we route requests to different AI models based on complexity, risk, SLA, and cost.Simple tasks go to small models or even rules, while only complex, high-value cases reach large LLMs.In production, 60–70% of our traffic was handled by Tier-1 models, 25–30% by Tier-2, and less than 10% by large models.This reduced cost per inference by ~40% while improving latency and maintaining compliance.”
🏦 1️⃣ Embedding Model Selection Policy
This governs how you choose the model used for semantic retrieval (RAG, search, clustering).
🔹 Policy Objective
Ensure high-recall, deterministic, and compliant semantic retrieval of enterprise documents without autonomous decision-making.
🔹 A. Functional Selection Criteria
1️⃣ Retrieval Accuracy (Primary Criterion)
Must benchmark on internal gold dataset.
Minimum thresholds:
Recall@5 ≥ 90%
Recall@10 ≥ 95%
MRR ≥ 0.70
If below threshold → model rejected.
2️⃣ Domain Adaptability
Model must:
Handle financial terminology
Recognize synonyms (LTV vs Funding cap)
Work with regulatory language
Support multilingual if required (e.g., English + Hindi)
3️⃣ Chunk Compatibility
Model must:
Perform well with 500–800 token chunks
Preserve semantic similarity for clause-level retrieval
Support heading-based segmentation
🔹 B. Technical Criteria
4️⃣ Deterministic Output
Embedding must be stable:
Same text → same vectorNo randomness allowed.
5️⃣ Deployment Compatibility
Depending on data classification:
Classification | Deployment Rule |
Public | Cloud allowed |
Internal | VPC only |
Confidential | On-prem only |
6️⃣ Vector Dimension Efficiency
Evaluate:
Dimensional size (e.g., 768 vs 1024 vs 1536)
Storage impact
Retrieval latency
🔹 C. Risk & Governance Criteria
7️⃣ No Training on Bank Data
Model must:
Not retain enterprise data
Not fine-tune externally unless approved
8️⃣ Version Locking
Model version must be fixed
Re-embedding required upon upgrade
Change requires governance approval
🔹 D. Approved Embedding Model Categories
Category | Example Use |
Open-source (on-prem) | BGE-M3, e5-large |
Managed enterprise SaaS | Azure text-embedding-3-large |
Lightweight edge | all-mpnet-base-v2 |
Final selection must follow benchmarking report.
🤖 2️⃣ SLM / LLM Model Selection Policy
This governs the generation layer.
🔹 Policy Objective
Ensure safe, explainable, and controlled generation aligned to enterprise and regulatory standards.
🔹 A. Use-Case Based Model Class
1️⃣ Retrieval-Augmented Answering (Policy Q&A)
Preferred:
SLM (Phi-3, Llama 3 8B, Mistral 7B)
Why?
Lower hallucination risk
Faster inference
Controlled cost
2️⃣ Complex Reasoning / Analysis
Use:
Larger LLM (GPT-4 class or Llama 3 70B)
Use only when:
Multi-step reasoning needed
Cross-policy comparison required
Summarization across documents
3️⃣ Drafting / Communication
Use:
Larger LLM for drafting emails, summaries
Not used for decision support.
🔹 B. Hallucination Risk Policy
Model must:
Operate in RAG mode
Never answer outside retrieved context
Return “Not found in policy” if no match
Similarity threshold must be enforced.
🔹 C. Data Residency & Privacy
If input contains:
Customer data
Financial details
PII
Then:
Only on-prem or VPC model allowed
No public API usage
🔹 D. Explainability Requirement
Generation model must:
Provide clause citation
Provide source metadata
Log prompt + response
Black-box autonomous generation not allowed.
🔹 E. Latency & Cost Governance
Define acceptable SLA:
Policy Q&A → < 2 sec
Internal agent support → < 3 sec
Choose SLM if SLA critical.
📊 3️⃣ Model Selection Based on Use Case
Here is the practical matrix you can put in architecture document:
🏦 Enterprise Knowledge Hub
Use Case | Embedding Model | SLM / LLM | Reason |
Credit policy lookup | High recall (BGE-M3) | SLM (Llama 3 8B) | RAG, deterministic |
SOP search | Medium model OK | No LLM (semantic search only) | No generation needed |
Clause comparison | High recall | Larger LLM | Multi-doc reasoning |
Regulatory circular diff | High recall | Larger LLM | Analytical summarization |
Internal chatbot | Balanced | SLM | Cost + control |
Decision automation | High recall | Rule engine + SLM assist | Avoid autonomous AI |
🧠 Strategic Enterprise Principle
Embedding model = Retrieval accuracySLM/LLM = Language reasoning
Never couple their selection.
Evaluate independently.
🛡️ Governance Rule (Very Important)
Before production approval:
Embedding benchmark report attached
Generation hallucination test report attached
Adversarial testing performed
Risk classification documented
Monitoring plan approved
Only then model is production-ready.
🎯
“We maintain separate selection policies for embedding and generation layers. Embeddings are chosen based on Recall@K and deterministic behavior. SLM/LLM selection is based on reasoning complexity, hallucination risk, and data residency requirements.”
🏦 ENTERPRISE AI GOVERNANCE POLICY
(For Knowledge Hub – Policy & SOP Retrieval Platform)
1️⃣ Purpose
This policy defines governance standards for the selection, validation, deployment, and monitoring of AI models (Embedding Models and SLM/LLM) used in the Bank’s Enterprise Knowledge Hub platform.
The objective is to ensure:
Regulatory compliance
Controlled hallucination risk
Explainability and auditability
Data privacy protection
Measurable performance
2️⃣ Scope
This policy applies to:
Embedding models used for semantic retrieval
Small Language Models (SLMs)
Large Language Models (LLMs)
Vector databases
Retrieval-Augmented Generation (RAG) systems
Internal AI-powered assistants accessing policy documents
This policy does NOT permit autonomous credit decisioning.
3️⃣ Model Classification Framework
Model Type | Function | Risk Category |
Embedding Model | Semantic retrieval | Low–Moderate |
SLM (≤ 8B params) | Context-bound generation | Moderate |
Large LLM (> 30B) | Advanced reasoning | Moderate–High |
Autonomous Agentic AI | Decision support | High (Restricted) |
4️⃣ Embedding Model Selection Policy
4.1 Functional Requirements
Embedding models must:
Achieve Recall@5 ≥ 90%
Achieve Recall@10 ≥ 95%
Achieve MRR ≥ 0.70
Support financial terminology
Preserve clause-level semantics
4.2 Technical Requirements
Deterministic output (same input → same vector)
Document chunk compatibility (500–800 tokens)
Support metadata indexing
Scalable vector dimension management
Deployment compliant with data classification
4.3 Data Residency Rules
Data Sensitivity | Deployment Rule |
Public | Cloud allowed |
Internal | Private VPC |
Confidential / Regulatory | On-Prem only |
4.4 Model Change Management
Model version must be locked in registry
Re-embedding required upon model upgrade
Change approval required from:
Enterprise Architecture
Information Security
Model Risk Team
5️⃣ SLM / LLM Selection Policy
5.1 Use-Case-Based Model Allocation
A. Policy Q&A (RAG Only)
Approved:
Small Language Models (≤ 8B parameters)
Reason:
Reduced hallucination risk
Lower cost
Faster response
Better operational control
B. Complex Policy Analysis
Approved:
Larger LLM under controlled environment
Requires:
Explicit approval
Additional hallucination testing
Legal/compliance review
5.2 Hallucination Control Requirements
Generation models must:
Operate only with retrieved context
Return “Not available in policy” when no match
Enforce similarity threshold before answering
Provide clause citation in response
5.3 Prohibited Use Cases
Without Board-level approval:
Autonomous credit decisions
Risk scoring
Underwriting replacement
Regulatory interpretation without citation
6️⃣ Validation & Benchmarking Framework
6.1 Gold Dataset Requirement
Minimum:
200 domain queries
Verified ground-truth clauses
Coverage across policy categories
6.2 Mandatory Metrics
Recall@5
Recall@10
Mean Reciprocal Rank (MRR)
Retrieval latency
Benchmark results must be documented before production.
6.3 CI/CD Integration
Automated evaluation during deployment
Threshold validation before release
Drift detection monitoring monthly
7️⃣ Monitoring & Ongoing Oversight
7.1 Production Monitoring
Log:
Query
Retrieved chunks
Similarity score
Model version
Response
7.2 Drift Monitoring
Monthly:
Sample 50–100 queries
Manual validation
Recalculate Recall@5
If performance drops >5% → investigation required.
8️⃣ Explainability & Auditability
Every AI response must include:
Policy Name
Clause Number
Version
Effective Date
All interactions retained for minimum regulatory retention period.
9️⃣ Risk Assessment & Controls
Risk | Mitigation |
Retrieval miss | High Recall threshold |
Hallucination | RAG enforcement |
Model drift | Scheduled validation |
Version conflict | Immutable policy storage |
Data leakage | On-prem deployment controls |
🔟 Governance Structure
Oversight Committee:
CIO (Chair)
CRO
Head of Compliance
Enterprise Architect
Model Risk Officer
Information Security Lead
Approval required before:
New model introduction
Major version upgrade
Use-case expansion
🏛️ Alignment with RBI Expectations
This framework aligns with:
Model Risk Governance principles
Explainability requirements
Audit trail requirements
Data localization norms
Controlled AI adoption guidelines
Key design principle:
AI system provides assisted intelligence, not autonomous decision-making.
.png)

Comments