Best AI Learning Lesson
- Anand Nerurkar
- Nov 26, 2025
- 9 min read
Model Drift,Context Drift
====
✅ 1. What is Context Drift?
Context Drift happens when a Large Language Model (LLM) or AI agent loses track of the conversation state, task state, or prior facts during a multi-step interaction.
Where it occurs
Multi-step agents
Complex workflows (KYC → Risk → Underwriting)
Long conversations
RAG-based interactions
Multi-agent orchestration
Symptoms
The agent forgets earlier instructions or contradicts previous steps.
The agent starts answering based on stale or missing information.
Different agents in the pipeline lose shared state.
It re-asks the same question or misidentifies the user/task.
Tool results come back but LLM ignores them.
Root Causes
Token Limit TruncationEarlier conversation gets dropped when the prompt becomes too large.
Weak State ManagementAgent relies only on the LLM memory instead of a structured session memory.
Bad RetrievalRAG retrieves the wrong documents due to low similarity or wrong embeddings.
Lost/Overridden VariablesOrchestrator not passing required metadata (sessionId, customerId etc.)
No Grounding LayerAI rewriting context incorrectly when summarizing or condensing memory.
How to Prevent It
Use a state store (Redis/Postgres) for canonical memory.
Use rolling summaries for long interactions.
Apply minimum similarity thresholds for RAG retrieval.
Implement context rehydration on every step.
Keep prompts stable and versioned.
✅ 2. What is Model Drift?
Model Drift is when an AI/ML model’s predictions become less accurate over time because the real-world data has changed from the training data distribution.
This applies to ML models (fraud, scoring, segmentation) and also LLM fine-tuned models.
Symptoms
Increasing error rate or decreasing accuracy.
Fraud model missing new fraud patterns.
Credit scoring model rejecting good customers or approving risky ones.
Chatbot/LLM giving less relevant answers suddenly.
RAG retrieval mismatches because embeddings no longer match the vector DB.
Root Causes
Data DriftInput data distribution changes(e.g., new PAN formats, new invoice layouts, new fraud vectors).
Concept DriftRelationship between input and output changes(e.g., what indicates “risky borrower” changes after new regulations).
Feature DriftSome features stop being useful or lose predictive power.
Model AgingEmbedding model or LLM version becomes outdated; external world evolves.
Operational ChangesUpstream systems modify schema, fields, or patterns.
How to Prevent It
Continuous monitoring of:
Prediction confidence
Drift metrics (PSI, KS, AUC, recall, F1)
Feature distribution
Periodic retraining (daily, weekly, monthly depending on domain).
Automate an MLOps pipeline for:
Data validation
Drift detection
Auto-retraining
A/B testing
Canary deploys
Maintain benchmark datasets for regression testing.
Maintain lineage & versioning for models, data, embeddings.
🆚 Context Drift vs Model Drift (Quick Comparison)
Aspect | Context Drift | Model Drift |
Where it occurs | LLM conversations, agents, RAG, orchestration | ML models, fine-tuned LLMs |
What changes | The conversation or agent memory | The underlying data distribution |
Symptom | LLM loses track of meaning | Model accuracy deteriorates |
Cause | Context mismanagement or token limits | Changing world, new behavior patterns |
Fix | Better memory, retrieval, orchestration | Monitoring, retraining, versioning |
Detected by | Agent loops, inconsistent answers | Drift dashboards, performance decline |
🎯
Context Drift → “Agent forgets the task.”
Model Drift → “Model forgets the world.”
minimum similarity??context rehydration on every step
===
✅ 1. What Is Minimum Similarity (in RAG)?
When a user query comes in, the system performs vector search to find the most relevant documents.
Each document returns a similarity score (typically cosine similarity) → between 0 and 1.
Example:
0.92 = Highly relevant
0.75 = Maybe relevant
0.52 = Low relevance
0.20 = Not relevant
If you don’t set a minimum similarity threshold, RAG may return wrong or unrelated documents.This leads to context drift, hallucination, and wrong answers.
✅ Minimum Similarity Threshold
You define a cut-off score like:
0.80+ → Only return if highly relevant
0.70–0.80 → Return with caution (fallback context)
<0.70 → Don’t return any document, use fallback response
Why it matters
Without minimum similarity:
LLM receives irrelevant documents
It uses incorrect context
It “drifts” into the wrong answer
Agents produce incorrect actions
Example for BFSI
If user asks:“What is the maximum LRS limit for outward remittance?”
But vector search returns an unrelated RBI circular (low similarity like 0.45),the LLM will answer incorrectly.
Setting similarity threshold protects you from wrong grounding.
🔧 Recommended Thresholds
Domain | Threshold |
BFSI / Regulatory | 0.85 |
Technical KB | 0.80 |
General FAQ | 0.75 |
High-risk domains require higher thresholds.
🎯
“Minimum similarity ensures that RAG only returns documents that are truly relevant. Anything below the threshold is ignored to avoid hallucinations and context drift.”
✅ 2. What Is Context Rehydration on Every Step?
In multi-step agent workflows, the LLM can’t remember the entire conversation because:
Token limits
Summaries lose details
The agent works across many tools and calls
State gets fragmented
Context Rehydration means:
At every step of the agent or workflow, we rebuild the full state of the conversation/task from a canonical memory store, instead of relying only on the LLM’s temporary memory.
The principle:
“Never trust the LLM to remember — always rehydrate the context.”
🧠 How it works (step-by-step)
Step 1 — Store structured state
Store state in:
Redis
Postgres
MongoDB
Vector DB with metadata
Agent memory store
Example:
{
"customerId": "12345",
"loanAmount": "15L",
"riskScore": "Moderate",
"documents": ["PAN.pdf", "Aadhaar.pdf"],
"lastAction": "RiskAgentCompleted"
}
Step 2 — Every time an agent runs
Before calling the model →pull fresh state from the memory store.
Step 3 — Rebuild prompt with accurate context
Inject the complete state into the agent’s input:
You are the Loan Evaluation Agent.
Here is the complete state:
- Customer ID: 12345
- Risk Score: Moderate
- Last step completed: RiskAgent
- Required documents: PAN, Aadhaar
- Missing documents: Bank statement
Step 4 — Agent produces next step
No forgettingNo driftNo contradiction
Step 5 — Save updated state back
After each decision, update the state and save it.
🎯 Why Context Rehydration Prevents Drift
Without rehydration:
Agent forgets previous values
Agents contradict each other
Workflow loops (infinite loops)
Wrong decisions based on stale memory
With rehydration:
Deterministic
Repeatable
Auditable
Stable
No drift
🏦 Example in Digital Lending
Underwriting Agent receives fresh state:
All documents extracted
CIBIL score = 780
Fraud score = Low
Loan ask = 15 lakh
Income verified = Yes
This ensures:
accurate risk decision
no confusion between customers
consistent underwriting
🎯
“Context rehydration means the agent never depends on temporary model memory. On every step, it re-pulls the entire state from an authoritative store and reconstructs the full context before reasoning. This prevents loops, drift, and inconsistencies in multi-agent workflows.”
FIXING CONTEXT & DATA DRIFT IN A RAG SYSTEM
🎯 Business Use Case
We built a Loan Policy Q&A RAG system for the operations team:
Source: RBI circulars + internal lending policy PDFs
Users: Credit officers & branch ops
Stack:
Ingestion → Chunking → Embeddings → PGVector
Query → Hybrid Search → Re-ranking → LLM (GPT)
🚨 PRODUCTION ISSUE OBSERVED (DRIFT SYMPTOMS)
After 2–3 months in production:
Users reported:
“System is giving outdated answers”
“Prepayment charges look wrong”
Evaluation logs showed:
High semantic similarity
But business incorrect answers
Ground truth:
Policy document was updated
But old embeddings were still being retrieved
This is a classic case of:
Knowledge Drift + Embedding Drift + Metadata Drift
✅ STEP-BY-STEP HOW I DIAGNOSED THE DRIFT
1️⃣ Detected via Monitoring
We had:
Query logs
Top-k retrieved chunks
Final answer
Human feedback flag (thumbs down)
Pattern observed:
Same chunk IDs repeatedly retrieved
Despite document version being updated in SharePoint
2️⃣ Root Cause Analysis
We found 3 concrete causes:
Drift Type | Cause |
Data Drift | New PDF uploaded but not re-ingested |
Embedding Drift | Old document chunks still in vector DB |
Metadata Drift | Document version field not updated |
So the system was:✅ Semantically correct❌ Temporally incorrect
✅ PRODUCTION FIX IMPLEMENTED (HARD GUARANTEES)
✅ 1. Document Versioning at Ingestion
Every document stored as:
doc_id
doc_version
doc_hash (SHA-256)
upload_timestamp
source_system
Now version becomes part of retrieval filter.
✅ 2. Delta Re-Embedding Pipeline (Not Full Rebuild)
Instead of re-embedding everything:
For each incoming document:
Compute hash of each chunk
Compare with previous chunk hashes
Only:
New chunks → Embed
Modified chunks → Re-embed
Deleted chunks → Soft delete vectors
This reduced:✅ Cost by ~70%✅ Ingestion time by ~60%
✅ 3. Hard Metadata Filters During Retrieval
At query time, we enforced:
doc_version = latest
is_active = true
business_unit = lending
So even if old vectors exist, they are never retrieved.
✅ 4. Similarity + Time Decay Scoring
Final relevance score:
final_score = (semantic_score * 0.8) + (recency_score * 0.2)
This ensures:
Newer policies naturally rank higher
Even if semantics are similar
✅ 5. Post-Answer Drift Validation
After the LLM generates the answer:
We run a:
Claim-to-chunk verification
If claim is not supported by:
Latest version only
We reject the answer and respond:
“This policy has been recently updated. Please refer to the latest document.”
✅ 6. Automated Drift Detection Job (Nightly)
A scheduled job checks:
Vector DB vs Document Store
Missing re-embeddings
Version mismatches
Orphan vectors
And raises:
Slack alerts
Jira tickets automatically
✅ FINAL PRODUCTION RESULT
Metric | Before | After |
Policy answer accuracy | 71% | 96% |
User complaints | High | Near zero |
Hallucination rate | 18% | < 3% |
Re-ingestion cost | High | Optimized |
✅
“In production, we observed RAG drift when updated loan policy documents were not reflected in answers. I diagnosed this as data drift and embedding drift caused by missing version control. We fixed it by introducing document versioning, delta re-embedding using chunk hashes, strict metadata filters during retrieval, and recency-weighted scoring. We also added a nightly drift audit job and post-answer claim validation. This reduced hallucinations below 3% and restored business trust.”
✅
“We fixed RAG drift by enforcing document versioning, delta re-embedding, recency-aware ranking, and post-answer verification — turning a failing production system into a trusted enterprise AI platform.”
✅ RAG CONTEXT DRIFT — DEBUGGING CHECKLIST
(Context drift = right model, wrong or outdated knowledge)
1️⃣ DOCUMENT & INGESTION CHECK
First confirm whether the source truth changed:
☐ Has the source document changed? (PDF, policy, KB, DB)
☐ Is there a new version or amendment?
☐ Was ingestion pipeline triggered after the change?
☐ Is last_ingested_timestamp ≥ last_updated_timestamp?
✅ If NO, root cause = data pipeline failure
2️⃣ CHUNKING DRIFT CHECK
Poor chunking causes semantic distortion:
☐ Are chunk boundaries still valid after doc update?
☐ Any headers merged with wrong sections?
☐ Chunk size increased beyond embedding sweet spot (512–1024 tokens)?
☐ Overlap missing → context break?
✅ If YES, root cause = structural context drift
3️⃣ EMBEDDING DRIFT CHECK
Embedding mismatch leads to silent failure:
☐ Was embedding model changed?
☐ Are old and new versions mixed?
☐ Same tokenizer used for ingestion & query?
☐ Any silent fallback to another embedding model?
✅ If YES, root cause = vector space corruption
4️⃣ VECTOR STORE INTEGRITY
Retrieve-level corruption:
☐ Are deleted docs still returning?
☐ Is soft-delete respected?
☐ Are duplicate vectors present?
☐ Metadata filters applied correctly?
✅ If NO, root cause = retrieval contamination
5️⃣ RETRIEVAL QUALITY CHECK
Validate actual retrieval, not just generation:
☐ Are top-5 chunks actually relevant on manual review?
☐ Similarity scores abnormally high for wrong content?
☐ Too many low-score chunks crossing threshold?
✅ If YES, root cause = retriever ranking drift
6️⃣ RERANKER HEALTH CHECK
Cross-encoder or reranker failure:
☐ Is reranker enabled?
☐ Model updated silently?
☐ Latency spike causing timeout fallback?
☐ Reranker scores uniformly flat?
✅ If YES, root cause = precision layer collapse
7️⃣ PROMPT CONTEXT INTEGRITY
Sometimes drift is prompt, not data:
☐ Is full context injected?
☐ Any truncation due to token limits?
☐ Instruction order changed?
☐ Context accidentally summarized before injection?
✅ If YES, root cause = prompt injection loss
8️⃣ POST-ANSWER VERIFICATION
Grounding failure detection:
☐ Does each generated claim map to a retrieved chunk?
☐ Any sentence unsupported by source?
☐ Is citation missing?
✅ If YES, root cause = grounding enforcement failure
✅ MODEL DRIFT — DEBUGGING CHECKLIST
(Model drift = same data, different behavior)
1️⃣ MODEL VERSION CHANGE
First check if the brain changed:
☐ Was the LLM upgraded?
☐ Silent version switch by provider?
☐ Context window reduced?
☐ Safety layer tightened?
✅ If YES, root cause = provider-side model drift
2️⃣ PROMPT REGRESSION CHECK
Most model drift is actually prompt drift:
☐ Any change in system prompts?
☐ Added new guardrails?
☐ Tool calling schema changed?
☐ Role order changed?
✅ If YES, root cause = instruction drift
3️⃣ TEMPERATURE & SAMPLING
Behavior shifts from decoding params:
☐ Temperature increase?
☐ Top-p changed?
☐ Frequency penalty modified?
✅ If YES, root cause = generation randomness drift
4️⃣ TOOL CALLING & AGENT BEHAVIOR
Agent loops often start here:
☐ Tool response format changed?
☐ Schema mismatch?
☐ Failure retried without state?
☐ STOP condition removed?
✅ If YES, root cause = agent execution drift
5️⃣ EVALUATION SCORE REGRESSION
Objective detection:
☐ Drop in exact match?
☐ Increase in ungrounded claims?
☐ Spike in refusal / “I don’t know”?
☐ Human feedback trend worsened?
✅ If YES, root cause = behavioral performance drift
✅
First I confirm whether the source data changed. Then I validate embedding and retrieval consistency. Next I check reranker and prompt injection integrity. If data is correct, I validate if the LLM or decoding parameters changed. Finally, I verify agent tool execution and stop conditions.
✅
Context Drift → “My knowledge is outdated”
Model Drift → “My reasoning style changed”
“When we see wrong answers in production, I first classify it as either context drift or model drift. For context drift, I trace ingestion freshness, chunking integrity, embedding model consistency, vector store contamination, retrieval ranking, and prompt injection completeness. For model drift, I check LLM version changes, prompt regressions, decoding parameters, tool execution stability, and evaluation score drops. This layered checklist lets us isolate drift in minutes instead of days.”
.png)

Comments