top of page

Model Tiering- AI Cost Economics

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Dec 18, 2025
  • 6 min read

Updated: Mar 3

🧠 What is Model Tiering in GenAI?

Model tiering is an architectural strategy where multiple AI models of different sizes, costs, and capabilities are used together, and each request is routed to the most cost-effective model that can meet the requirement.

Not every query needs the most powerful (and expensive) model.

🎯 Why Model Tiering is Critical (Especially in BFSI)

Without tiering:

  • Every request hits a large LLM

  • Costs explode

  • Latency increases

  • Risk surface grows

With tiering:

  • 60–75% of traffic handled by small models

  • Large models used only for complex cases

  • Predictable cost + better SLA

🏗️ Typical Model Tiers (Enterprise Reality)

Tier

Model Type

Usage

Tier-0

Rules / retrieval / templates

FAQs, static answers

Tier-1

Small / distilled LLMs

Summarization, classification

Tier-2

Medium LLMs

RAG, reasoning, analysis

Tier-3

Large / premium LLMs

Complex reasoning, edge cases

🔀 How Do You Decide Which Tier to Use?

You decide based on 4 dimensions:

1️⃣ Task Complexity

Task

Tier

Keyword lookup / FAQ

Tier-0

Simple summarization

Tier-1

Policy Q&A (RAG)

Tier-2

Multi-step reasoning

Tier-3

2️⃣ Risk & Compliance Sensitivity

Risk Level

Tier

Low (internal ops)

Tier-1 / Tier-2

Medium (customer-facing)

Tier-2

High (credit, compliance)

Tier-2 + human

Critical decisions

Human only

In BFSI, GenAI supports decisions — it does not make them.

3️⃣ Latency & SLA

SLA

Tier

<300 ms

Tier-0 / Tier-1

<800 ms

Tier-2

Async allowed

Tier-3

4️⃣ Cost Envelope

Cost Target

Tier

<₹1 per inference

Tier-1

₹1–₹3

Tier-2

₹5+

Tier-3

🧭 Routing Logic (Enterprise Pattern)

Request →
  Complexity Check →
  Risk Classification →
  SLA Requirement →
  Budget Check →
  Model Tier Selection →
  Fallback / Escalation

📊 Realistic Banking Distribution (What Sounds Real)

Tier

Traffic %

Tier-0

10–15%

Tier-1

45–55%

Tier-2

25–30%

Tier-3

5–10%

If someone says “most traffic goes to GPT-4”, they haven’t scaled GenAI.

💰 Impact of Model Tiering (Real Numbers)

Metric

Before

After

Cost / inference

₹3.8

₹1.9

Monthly AI spend

₹5 Cr

₹2.8 Cr

P95 latency

900 ms

480 ms

SLA breaches

Frequent

Rare

🎤 Summary

“Model tiering is an architectural approach where we route requests to different AI models based on complexity, risk, SLA, and cost.Simple tasks go to small models or even rules, while only complex, high-value cases reach large LLMs.In production, 60–70% of our traffic was handled by Tier-1 models, 25–30% by Tier-2, and less than 10% by large models.This reduced cost per inference by ~40% while improving latency and maintaining compliance.”

🏦 1️⃣ Embedding Model Selection Policy

This governs how you choose the model used for semantic retrieval (RAG, search, clustering).

🔹 Policy Objective

Ensure high-recall, deterministic, and compliant semantic retrieval of enterprise documents without autonomous decision-making.

🔹 A. Functional Selection Criteria

1️⃣ Retrieval Accuracy (Primary Criterion)

Must benchmark on internal gold dataset.

Minimum thresholds:

  • Recall@5 ≥ 90%

  • Recall@10 ≥ 95%

  • MRR ≥ 0.70

If below threshold → model rejected.

2️⃣ Domain Adaptability

Model must:

  • Handle financial terminology

  • Recognize synonyms (LTV vs Funding cap)

  • Work with regulatory language

  • Support multilingual if required (e.g., English + Hindi)

3️⃣ Chunk Compatibility

Model must:

  • Perform well with 500–800 token chunks

  • Preserve semantic similarity for clause-level retrieval

  • Support heading-based segmentation

🔹 B. Technical Criteria

4️⃣ Deterministic Output

Embedding must be stable:

Same text → same vectorNo randomness allowed.

5️⃣ Deployment Compatibility

Depending on data classification:

Classification

Deployment Rule

Public

Cloud allowed

Internal

VPC only

Confidential

On-prem only

6️⃣ Vector Dimension Efficiency

Evaluate:

  • Dimensional size (e.g., 768 vs 1024 vs 1536)

  • Storage impact

  • Retrieval latency

🔹 C. Risk & Governance Criteria

7️⃣ No Training on Bank Data

Model must:

  • Not retain enterprise data

  • Not fine-tune externally unless approved

8️⃣ Version Locking

  • Model version must be fixed

  • Re-embedding required upon upgrade

  • Change requires governance approval

🔹 D. Approved Embedding Model Categories

Category

Example Use

Open-source (on-prem)

BGE-M3, e5-large

Managed enterprise SaaS

Azure text-embedding-3-large

Lightweight edge

all-mpnet-base-v2

Final selection must follow benchmarking report.

🤖 2️⃣ SLM / LLM Model Selection Policy

This governs the generation layer.

🔹 Policy Objective

Ensure safe, explainable, and controlled generation aligned to enterprise and regulatory standards.

🔹 A. Use-Case Based Model Class

1️⃣ Retrieval-Augmented Answering (Policy Q&A)

Preferred:

  • SLM (Phi-3, Llama 3 8B, Mistral 7B)

Why?

  • Lower hallucination risk

  • Faster inference

  • Controlled cost

2️⃣ Complex Reasoning / Analysis

Use:

  • Larger LLM (GPT-4 class or Llama 3 70B)

Use only when:

  • Multi-step reasoning needed

  • Cross-policy comparison required

  • Summarization across documents

3️⃣ Drafting / Communication

Use:

  • Larger LLM for drafting emails, summaries

Not used for decision support.

🔹 B. Hallucination Risk Policy

Model must:

  • Operate in RAG mode

  • Never answer outside retrieved context

  • Return “Not found in policy” if no match

Similarity threshold must be enforced.

🔹 C. Data Residency & Privacy

If input contains:

  • Customer data

  • Financial details

  • PII

Then:

  • Only on-prem or VPC model allowed

  • No public API usage

🔹 D. Explainability Requirement

Generation model must:

  • Provide clause citation

  • Provide source metadata

  • Log prompt + response

Black-box autonomous generation not allowed.

🔹 E. Latency & Cost Governance

Define acceptable SLA:

  • Policy Q&A → < 2 sec

  • Internal agent support → < 3 sec

Choose SLM if SLA critical.

📊 3️⃣ Model Selection Based on Use Case

Here is the practical matrix you can put in architecture document:

🏦 Enterprise Knowledge Hub

Use Case

Embedding Model

SLM / LLM

Reason

Credit policy lookup

High recall (BGE-M3)

SLM (Llama 3 8B)

RAG, deterministic

SOP search

Medium model OK

No LLM (semantic search only)

No generation needed

Clause comparison

High recall

Larger LLM

Multi-doc reasoning

Regulatory circular diff

High recall

Larger LLM

Analytical summarization

Internal chatbot

Balanced

SLM

Cost + control

Decision automation

High recall

Rule engine + SLM assist

Avoid autonomous AI

🧠 Strategic Enterprise Principle

Embedding model = Retrieval accuracySLM/LLM = Language reasoning

Never couple their selection.

Evaluate independently.

🛡️ Governance Rule (Very Important)

Before production approval:

  1. Embedding benchmark report attached

  2. Generation hallucination test report attached

  3. Adversarial testing performed

  4. Risk classification documented

  5. Monitoring plan approved

Only then model is production-ready.

🎯

“We maintain separate selection policies for embedding and generation layers. Embeddings are chosen based on Recall@K and deterministic behavior. SLM/LLM selection is based on reasoning complexity, hallucination risk, and data residency requirements.”

🏦 ENTERPRISE AI GOVERNANCE POLICY

(For Knowledge Hub – Policy & SOP Retrieval Platform)

1️⃣ Purpose

This policy defines governance standards for the selection, validation, deployment, and monitoring of AI models (Embedding Models and SLM/LLM) used in the Bank’s Enterprise Knowledge Hub platform.

The objective is to ensure:

  • Regulatory compliance

  • Controlled hallucination risk

  • Explainability and auditability

  • Data privacy protection

  • Measurable performance

2️⃣ Scope

This policy applies to:

  • Embedding models used for semantic retrieval

  • Small Language Models (SLMs)

  • Large Language Models (LLMs)

  • Vector databases

  • Retrieval-Augmented Generation (RAG) systems

  • Internal AI-powered assistants accessing policy documents

This policy does NOT permit autonomous credit decisioning.

3️⃣ Model Classification Framework

Model Type

Function

Risk Category

Embedding Model

Semantic retrieval

Low–Moderate

SLM (≤ 8B params)

Context-bound generation

Moderate

Large LLM (> 30B)

Advanced reasoning

Moderate–High

Autonomous Agentic AI

Decision support

High (Restricted)

4️⃣ Embedding Model Selection Policy

4.1 Functional Requirements

Embedding models must:

  • Achieve Recall@5 ≥ 90%

  • Achieve Recall@10 ≥ 95%

  • Achieve MRR ≥ 0.70

  • Support financial terminology

  • Preserve clause-level semantics

4.2 Technical Requirements

  • Deterministic output (same input → same vector)

  • Document chunk compatibility (500–800 tokens)

  • Support metadata indexing

  • Scalable vector dimension management

  • Deployment compliant with data classification

4.3 Data Residency Rules

Data Sensitivity

Deployment Rule

Public

Cloud allowed

Internal

Private VPC

Confidential / Regulatory

On-Prem only

4.4 Model Change Management

  • Model version must be locked in registry

  • Re-embedding required upon model upgrade

  • Change approval required from:

    • Enterprise Architecture

    • Information Security

    • Model Risk Team

5️⃣ SLM / LLM Selection Policy

5.1 Use-Case-Based Model Allocation

A. Policy Q&A (RAG Only)

Approved:

  • Small Language Models (≤ 8B parameters)

Reason:

  • Reduced hallucination risk

  • Lower cost

  • Faster response

  • Better operational control

B. Complex Policy Analysis

Approved:

  • Larger LLM under controlled environment

Requires:

  • Explicit approval

  • Additional hallucination testing

  • Legal/compliance review

5.2 Hallucination Control Requirements

Generation models must:

  • Operate only with retrieved context

  • Return “Not available in policy” when no match

  • Enforce similarity threshold before answering

  • Provide clause citation in response

5.3 Prohibited Use Cases

Without Board-level approval:

  • Autonomous credit decisions

  • Risk scoring

  • Underwriting replacement

  • Regulatory interpretation without citation

6️⃣ Validation & Benchmarking Framework

6.1 Gold Dataset Requirement

Minimum:

  • 200 domain queries

  • Verified ground-truth clauses

  • Coverage across policy categories

6.2 Mandatory Metrics

  • Recall@5

  • Recall@10

  • Mean Reciprocal Rank (MRR)

  • Retrieval latency

Benchmark results must be documented before production.

6.3 CI/CD Integration

  • Automated evaluation during deployment

  • Threshold validation before release

  • Drift detection monitoring monthly

7️⃣ Monitoring & Ongoing Oversight

7.1 Production Monitoring

Log:

  • Query

  • Retrieved chunks

  • Similarity score

  • Model version

  • Response

7.2 Drift Monitoring

Monthly:

  • Sample 50–100 queries

  • Manual validation

  • Recalculate Recall@5

If performance drops >5% → investigation required.

8️⃣ Explainability & Auditability

Every AI response must include:

  • Policy Name

  • Clause Number

  • Version

  • Effective Date

All interactions retained for minimum regulatory retention period.

9️⃣ Risk Assessment & Controls

Risk

Mitigation

Retrieval miss

High Recall threshold

Hallucination

RAG enforcement

Model drift

Scheduled validation

Version conflict

Immutable policy storage

Data leakage

On-prem deployment controls

🔟 Governance Structure

Oversight Committee:

  • CIO (Chair)

  • CRO

  • Head of Compliance

  • Enterprise Architect

  • Model Risk Officer

  • Information Security Lead

Approval required before:

  • New model introduction

  • Major version upgrade

  • Use-case expansion

🏛️ Alignment with RBI Expectations

This framework aligns with:

  • Model Risk Governance principles

  • Explainability requirements

  • Audit trail requirements

  • Data localization norms

  • Controlled AI adoption guidelines

Key design principle:

AI system provides assisted intelligence, not autonomous decision-making.

 
 
 

Recent Posts

See All
RFP PRE/POST-PROPOSAL SUBMISSION FLOW

🏆 1. The 5 Pillars to Win a Large Strategic Deal 1. Understand the Client Better Than They Do 👉 Don’t just read RFP — decode it What is their real problem ? What is driving this deal? (compliance, c

 
 
 
DIGITAL LENDING RFP Solution

🎯 RFP Proposal SOLUTION PRESENTATION – DIGITAL LENDING (WITH COLOR-CODED ARCHITECTURE) 1️⃣ Opening “Thank you for the opportunity. I’ll walk you through our approach to building a next-generation dig

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page