Best AI Learning Lesson

Anand Nerurkar
Nov 26, 2025
9 min read

Model Drift,Context Drift

====

✅ 1. What is Context Drift?

Context Drift happens when a Large Language Model (LLM) or AI agent loses track of the conversation state, task state, or prior facts during a multi-step interaction.

Where it occurs

Multi-step agents
Complex workflows (KYC → Risk → Underwriting)
Long conversations
RAG-based interactions
Multi-agent orchestration

Symptoms

The agent forgets earlier instructions or contradicts previous steps.
The agent starts answering based on stale or missing information.
Different agents in the pipeline lose shared state.
It re-asks the same question or misidentifies the user/task.
Tool results come back but LLM ignores them.

Root Causes

Token Limit TruncationEarlier conversation gets dropped when the prompt becomes too large.
Weak State ManagementAgent relies only on the LLM memory instead of a structured session memory.
Bad RetrievalRAG retrieves the wrong documents due to low similarity or wrong embeddings.
Lost/Overridden VariablesOrchestrator not passing required metadata (sessionId, customerId etc.)
No Grounding LayerAI rewriting context incorrectly when summarizing or condensing memory.

How to Prevent It

Use a state store (Redis/Postgres) for canonical memory.
Use rolling summaries for long interactions.
Apply minimum similarity thresholds for RAG retrieval.
Implement context rehydration on every step.
Keep prompts stable and versioned.

✅ 2. What is Model Drift?

Model Drift is when an AI/ML model’s predictions become less accurate over time because the real-world data has changed from the training data distribution.

This applies to ML models (fraud, scoring, segmentation) and also LLM fine-tuned models.

Symptoms

Increasing error rate or decreasing accuracy.
Fraud model missing new fraud patterns.
Credit scoring model rejecting good customers or approving risky ones.
Chatbot/LLM giving less relevant answers suddenly.
RAG retrieval mismatches because embeddings no longer match the vector DB.

Root Causes

Data DriftInput data distribution changes(e.g., new PAN formats, new invoice layouts, new fraud vectors).
Concept DriftRelationship between input and output changes(e.g., what indicates “risky borrower” changes after new regulations).
Feature DriftSome features stop being useful or lose predictive power.
Model AgingEmbedding model or LLM version becomes outdated; external world evolves.
Operational ChangesUpstream systems modify schema, fields, or patterns.

How to Prevent It

Continuous monitoring of:
- Prediction confidence
- Drift metrics (PSI, KS, AUC, recall, F1)
- Feature distribution
Periodic retraining (daily, weekly, monthly depending on domain).
Automate an MLOps pipeline for:
- Data validation
- Drift detection
- Auto-retraining
- A/B testing
- Canary deploys
Maintain benchmark datasets for regression testing.
Maintain lineage & versioning for models, data, embeddings.

🆚 Context Drift vs Model Drift (Quick Comparison)

Aspect	Context Drift	Model Drift
Where it occurs	LLM conversations, agents, RAG, orchestration	ML models, fine-tuned LLMs
What changes	The conversation or agent memory	The underlying data distribution
Symptom	LLM loses track of meaning	Model accuracy deteriorates
Cause	Context mismanagement or token limits	Changing world, new behavior patterns
Fix	Better memory, retrieval, orchestration	Monitoring, retraining, versioning
Detected by	Agent loops, inconsistent answers	Drift dashboards, performance decline

🎯

Context Drift → “Agent forgets the task.”
Model Drift → “Model forgets the world.”

minimum similarity??context rehydration on every step

===

✅ 1. What Is Minimum Similarity (in RAG)?

When a user query comes in, the system performs vector search to find the most relevant documents.

Each document returns a similarity score (typically cosine similarity) → between 0 and 1.

Example:

0.92 = Highly relevant
0.75 = Maybe relevant
0.52 = Low relevance
0.20 = Not relevant

If you don’t set a minimum similarity threshold, RAG may return wrong or unrelated documents.This leads to context drift, hallucination, and wrong answers.

✅ Minimum Similarity Threshold

You define a cut-off score like:

0.80+ → Only return if highly relevant
0.70–0.80 → Return with caution (fallback context)
<0.70 → Don’t return any document, use fallback response

Why it matters

Without minimum similarity:

LLM receives irrelevant documents
It uses incorrect context
It “drifts” into the wrong answer
Agents produce incorrect actions

Example for BFSI

If user asks:“What is the maximum LRS limit for outward remittance?”

But vector search returns an unrelated RBI circular (low similarity like 0.45),the LLM will answer incorrectly.

Setting similarity threshold protects you from wrong grounding.

🔧 Recommended Thresholds

Domain	Threshold
BFSI / Regulatory	0.85
Technical KB	0.80
General FAQ	0.75

High-risk domains require higher thresholds.

🎯

“Minimum similarity ensures that RAG only returns documents that are truly relevant. Anything below the threshold is ignored to avoid hallucinations and context drift.”

✅ 2. What Is Context Rehydration on Every Step?

In multi-step agent workflows, the LLM can’t remember the entire conversation because:

Token limits
Summaries lose details
The agent works across many tools and calls
State gets fragmented

Context Rehydration means:

At every step of the agent or workflow, we rebuild the full state of the conversation/task from a canonical memory store, instead of relying only on the LLM’s temporary memory.

The principle:

“Never trust the LLM to remember — always rehydrate the context.”

🧠 How it works (step-by-step)

Step 1 — Store structured state

Store state in:

Redis
Postgres
MongoDB
Vector DB with metadata
Agent memory store

Example:

{
  "customerId": "12345",
  "loanAmount": "15L",
  "riskScore": "Moderate",
  "documents": ["PAN.pdf", "Aadhaar.pdf"],
  "lastAction": "RiskAgentCompleted"
}

Step 2 — Every time an agent runs

Before calling the model →pull fresh state from the memory store.

Step 3 — Rebuild prompt with accurate context

Inject the complete state into the agent’s input:

You are the Loan Evaluation Agent.
Here is the complete state:

- Customer ID: 12345
- Risk Score: Moderate
- Last step completed: RiskAgent
- Required documents: PAN, Aadhaar
- Missing documents: Bank statement

Step 4 — Agent produces next step

No forgettingNo driftNo contradiction

Step 5 — Save updated state back

After each decision, update the state and save it.

🎯 Why Context Rehydration Prevents Drift

Without rehydration:

Agent forgets previous values
Agents contradict each other
Workflow loops (infinite loops)
Wrong decisions based on stale memory

With rehydration:

Deterministic
Repeatable
Auditable
Stable
No drift

🏦 Example in Digital Lending

Underwriting Agent receives fresh state:

All documents extracted
CIBIL score = 780
Fraud score = Low
Loan ask = 15 lakh
Income verified = Yes

This ensures:

accurate risk decision
no confusion between customers
consistent underwriting

🎯

“Context rehydration means the agent never depends on temporary model memory. On every step, it re-pulls the entire state from an authoritative store and reconstructs the full context before reasoning. This prevents loops, drift, and inconsistencies in multi-agent workflows.”

FIXING CONTEXT & DATA DRIFT IN A RAG SYSTEM

🎯 Business Use Case

We built a Loan Policy Q&A RAG system for the operations team:

Source: RBI circulars + internal lending policy PDFs
Users: Credit officers & branch ops
Stack:
- Ingestion → Chunking → Embeddings → PGVector
- Query → Hybrid Search → Re-ranking → LLM (GPT)

🚨 PRODUCTION ISSUE OBSERVED (DRIFT SYMPTOMS)

After 2–3 months in production:

Users reported:
- “System is giving outdated answers”
- “Prepayment charges look wrong”
Evaluation logs showed:
- High semantic similarity
- But business incorrect answers
Ground truth:
- Policy document was updated
- But old embeddings were still being retrieved

This is a classic case of:

Knowledge Drift + Embedding Drift + Metadata Drift

✅ STEP-BY-STEP HOW I DIAGNOSED THE DRIFT

1️⃣ Detected via Monitoring

We had:

Query logs
Top-k retrieved chunks
Final answer
Human feedback flag (thumbs down)

Pattern observed:

Same chunk IDs repeatedly retrieved
Despite document version being updated in SharePoint

2️⃣ Root Cause Analysis

We found 3 concrete causes:

Drift Type	Cause
Data Drift	New PDF uploaded but not re-ingested
Embedding Drift	Old document chunks still in vector DB
Metadata Drift	Document version field not updated

So the system was:✅ Semantically correct❌ Temporally incorrect

✅ PRODUCTION FIX IMPLEMENTED (HARD GUARANTEES)

✅ 1. Document Versioning at Ingestion

Every document stored as:

doc_id
doc_version
doc_hash (SHA-256)
upload_timestamp
source_system

Now version becomes part of retrieval filter.

✅ 2. Delta Re-Embedding Pipeline (Not Full Rebuild)

Instead of re-embedding everything:

For each incoming document:

Compute hash of each chunk
Compare with previous chunk hashes
Only:
- New chunks → Embed
- Modified chunks → Re-embed
- Deleted chunks → Soft delete vectors

This reduced:✅ Cost by ~70%✅ Ingestion time by ~60%

✅ 3. Hard Metadata Filters During Retrieval

At query time, we enforced:

doc_version = latest
is_active = true
business_unit = lending

So even if old vectors exist, they are never retrieved.

✅ 4. Similarity + Time Decay Scoring

Final relevance score:

final_score = (semantic_score * 0.8) + (recency_score * 0.2)

This ensures:

Newer policies naturally rank higher
Even if semantics are similar

✅ 5. Post-Answer Drift Validation

After the LLM generates the answer:

We run a:
- Claim-to-chunk verification
If claim is not supported by:
- Latest version only
We reject the answer and respond:

“This policy has been recently updated. Please refer to the latest document.”

✅ 6. Automated Drift Detection Job (Nightly)

A scheduled job checks:

Vector DB vs Document Store
Missing re-embeddings
Version mismatches
Orphan vectors

And raises:

Slack alerts
Jira tickets automatically

✅ FINAL PRODUCTION RESULT

Metric	Before	After
Policy answer accuracy	71%	96%
User complaints	High	Near zero
Hallucination rate	18%	< 3%
Re-ingestion cost	High	Optimized

✅

“In production, we observed RAG drift when updated loan policy documents were not reflected in answers. I diagnosed this as data drift and embedding drift caused by missing version control. We fixed it by introducing document versioning, delta re-embedding using chunk hashes, strict metadata filters during retrieval, and recency-weighted scoring. We also added a nightly drift audit job and post-answer claim validation. This reduced hallucinations below 3% and restored business trust.”

✅

“We fixed RAG drift by enforcing document versioning, delta re-embedding, recency-aware ranking, and post-answer verification — turning a failing production system into a trusted enterprise AI platform.”

✅ RAG CONTEXT DRIFT — DEBUGGING CHECKLIST

(Context drift = right model, wrong or outdated knowledge)

1️⃣ DOCUMENT & INGESTION CHECK

First confirm whether the source truth changed:

☐ Has the source document changed? (PDF, policy, KB, DB)
☐ Is there a new version or amendment?
☐ Was ingestion pipeline triggered after the change?
☐ Is last_ingested_timestamp ≥ last_updated_timestamp?

✅ If NO, root cause = data pipeline failure

2️⃣ CHUNKING DRIFT CHECK

Poor chunking causes semantic distortion:

☐ Are chunk boundaries still valid after doc update?
☐ Any headers merged with wrong sections?
☐ Chunk size increased beyond embedding sweet spot (512–1024 tokens)?
☐ Overlap missing → context break?

✅ If YES, root cause = structural context drift

3️⃣ EMBEDDING DRIFT CHECK

Embedding mismatch leads to silent failure:

☐ Was embedding model changed?
☐ Are old and new versions mixed?
☐ Same tokenizer used for ingestion & query?
☐ Any silent fallback to another embedding model?

✅ If YES, root cause = vector space corruption

4️⃣ VECTOR STORE INTEGRITY

Retrieve-level corruption:

☐ Are deleted docs still returning?
☐ Is soft-delete respected?
☐ Are duplicate vectors present?
☐ Metadata filters applied correctly?

✅ If NO, root cause = retrieval contamination

5️⃣ RETRIEVAL QUALITY CHECK

Validate actual retrieval, not just generation:

☐ Are top-5 chunks actually relevant on manual review?
☐ Similarity scores abnormally high for wrong content?
☐ Too many low-score chunks crossing threshold?

✅ If YES, root cause = retriever ranking drift

6️⃣ RERANKER HEALTH CHECK

Cross-encoder or reranker failure:

☐ Is reranker enabled?
☐ Model updated silently?
☐ Latency spike causing timeout fallback?
☐ Reranker scores uniformly flat?

✅ If YES, root cause = precision layer collapse

7️⃣ PROMPT CONTEXT INTEGRITY

Sometimes drift is prompt, not data:

☐ Is full context injected?
☐ Any truncation due to token limits?
☐ Instruction order changed?
☐ Context accidentally summarized before injection?

✅ If YES, root cause = prompt injection loss

8️⃣ POST-ANSWER VERIFICATION

Grounding failure detection:

☐ Does each generated claim map to a retrieved chunk?
☐ Any sentence unsupported by source?
☐ Is citation missing?

✅ If YES, root cause = grounding enforcement failure

✅ MODEL DRIFT — DEBUGGING CHECKLIST

(Model drift = same data, different behavior)

1️⃣ MODEL VERSION CHANGE

First check if the brain changed:

☐ Was the LLM upgraded?
☐ Silent version switch by provider?
☐ Context window reduced?
☐ Safety layer tightened?

✅ If YES, root cause = provider-side model drift

2️⃣ PROMPT REGRESSION CHECK

Most model drift is actually prompt drift:

☐ Any change in system prompts?
☐ Added new guardrails?
☐ Tool calling schema changed?
☐ Role order changed?

✅ If YES, root cause = instruction drift

3️⃣ TEMPERATURE & SAMPLING

Behavior shifts from decoding params:

☐ Temperature increase?
☐ Top-p changed?
☐ Frequency penalty modified?

✅ If YES, root cause = generation randomness drift

4️⃣ TOOL CALLING & AGENT BEHAVIOR

Agent loops often start here:

☐ Tool response format changed?
☐ Schema mismatch?
☐ Failure retried without state?
☐ STOP condition removed?

✅ If YES, root cause = agent execution drift

5️⃣ EVALUATION SCORE REGRESSION

Objective detection:

☐ Drop in exact match?
☐ Increase in ungrounded claims?
☐ Spike in refusal / “I don’t know”?
☐ Human feedback trend worsened?

✅ If YES, root cause = behavioral performance drift

✅

First I confirm whether the source data changed. Then I validate embedding and retrieval consistency. Next I check reranker and prompt injection integrity. If data is correct, I validate if the LLM or decoding parameters changed. Finally, I verify agent tool execution and stop conditions.

✅

Context Drift → “My knowledge is outdated”
Model Drift → “My reasoning style changed”

“When we see wrong answers in production, I first classify it as either context drift or model drift. For context drift, I trace ingestion freshness, chunking integrity, embedding model consistency, vector store contamination, retrieval ranking, and prompt injection completeness. For model drift, I check LLM version changes, prompt regressions, decoding parameters, tool execution stability, and evaluation score drops. This layered checklist lets us isolate drift in minutes instead of days.”

✅ 1. What is Context Drift?

Where it occurs

Symptoms

Root Causes

How to Prevent It

✅ 2. What is Model Drift?

Symptoms

Root Causes

How to Prevent It

🆚 Context Drift vs Model Drift (Quick Comparison)

🎯

✅ 1. What Is Minimum Similarity (in RAG)?

✅ Minimum Similarity Threshold

Why it matters

Example for BFSI

🔧 Recommended Thresholds

🎯

✅ 2. What Is Context Rehydration on Every Step?

The principle:

🧠 How it works (step-by-step)

Step 1 — Store structured state

Step 2 — Every time an agent runs

Step 3 — Rebuild prompt with accurate context

Step 4 — Agent produces next step

Step 5 — Save updated state back

🎯 Why Context Rehydration Prevents Drift

🏦 Example in Digital Lending

🎯

FIXING CONTEXT & DATA DRIFT IN A RAG SYSTEM

🎯 Business Use Case

🚨 PRODUCTION ISSUE OBSERVED (DRIFT SYMPTOMS)

✅ STEP-BY-STEP HOW I DIAGNOSED THE DRIFT

1️⃣ Detected via Monitoring

2️⃣ Root Cause Analysis

✅ PRODUCTION FIX IMPLEMENTED (HARD GUARANTEES)

✅ 1. Document Versioning at Ingestion

✅ 2. Delta Re-Embedding Pipeline (Not Full Rebuild)

For each incoming document:

✅ 3. Hard Metadata Filters During Retrieval

✅ 4. Similarity + Time Decay Scoring

✅ 5. Post-Answer Drift Validation

✅ 6. Automated Drift Detection Job (Nightly)

✅ FINAL PRODUCTION RESULT

✅

✅

✅ RAG CONTEXT DRIFT — DEBUGGING CHECKLIST

1️⃣ DOCUMENT & INGESTION CHECK

2️⃣ CHUNKING DRIFT CHECK

3️⃣ EMBEDDING DRIFT CHECK

4️⃣ VECTOR STORE INTEGRITY

5️⃣ RETRIEVAL QUALITY CHECK

6️⃣ RERANKER HEALTH CHECK

7️⃣ PROMPT CONTEXT INTEGRITY

8️⃣ POST-ANSWER VERIFICATION

✅ MODEL DRIFT — DEBUGGING CHECKLIST

1️⃣ MODEL VERSION CHANGE

2️⃣ PROMPT REGRESSION CHECK

3️⃣ TEMPERATURE & SAMPLING

4️⃣ TOOL CALLING & AGENT BEHAVIOR

5️⃣ EVALUATION SCORE REGRESSION

✅

✅

Comments