top of page

Best AI Learning Lesson

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Nov 26, 2025
  • 9 min read

Model Drift,Context Drift

====


1. What is Context Drift?

Context Drift happens when a Large Language Model (LLM) or AI agent loses track of the conversation state, task state, or prior facts during a multi-step interaction.

Where it occurs

  • Multi-step agents

  • Complex workflows (KYC → Risk → Underwriting)

  • Long conversations

  • RAG-based interactions

  • Multi-agent orchestration

Symptoms

  • The agent forgets earlier instructions or contradicts previous steps.

  • The agent starts answering based on stale or missing information.

  • Different agents in the pipeline lose shared state.

  • It re-asks the same question or misidentifies the user/task.

  • Tool results come back but LLM ignores them.

Root Causes

  1. Token Limit TruncationEarlier conversation gets dropped when the prompt becomes too large.

  2. Weak State ManagementAgent relies only on the LLM memory instead of a structured session memory.

  3. Bad RetrievalRAG retrieves the wrong documents due to low similarity or wrong embeddings.

  4. Lost/Overridden VariablesOrchestrator not passing required metadata (sessionId, customerId etc.)

  5. No Grounding LayerAI rewriting context incorrectly when summarizing or condensing memory.

How to Prevent It

  • Use a state store (Redis/Postgres) for canonical memory.

  • Use rolling summaries for long interactions.

  • Apply minimum similarity thresholds for RAG retrieval.

  • Implement context rehydration on every step.

  • Keep prompts stable and versioned.

2. What is Model Drift?

Model Drift is when an AI/ML model’s predictions become less accurate over time because the real-world data has changed from the training data distribution.

This applies to ML models (fraud, scoring, segmentation) and also LLM fine-tuned models.

Symptoms

  • Increasing error rate or decreasing accuracy.

  • Fraud model missing new fraud patterns.

  • Credit scoring model rejecting good customers or approving risky ones.

  • Chatbot/LLM giving less relevant answers suddenly.

  • RAG retrieval mismatches because embeddings no longer match the vector DB.

Root Causes

  1. Data DriftInput data distribution changes(e.g., new PAN formats, new invoice layouts, new fraud vectors).

  2. Concept DriftRelationship between input and output changes(e.g., what indicates “risky borrower” changes after new regulations).

  3. Feature DriftSome features stop being useful or lose predictive power.

  4. Model AgingEmbedding model or LLM version becomes outdated; external world evolves.

  5. Operational ChangesUpstream systems modify schema, fields, or patterns.

How to Prevent It

  • Continuous monitoring of:

    • Prediction confidence

    • Drift metrics (PSI, KS, AUC, recall, F1)

    • Feature distribution

  • Periodic retraining (daily, weekly, monthly depending on domain).

  • Automate an MLOps pipeline for:

    • Data validation

    • Drift detection

    • Auto-retraining

    • A/B testing

    • Canary deploys

  • Maintain benchmark datasets for regression testing.

  • Maintain lineage & versioning for models, data, embeddings.

🆚 Context Drift vs Model Drift (Quick Comparison)

Aspect

Context Drift

Model Drift

Where it occurs

LLM conversations, agents, RAG, orchestration

ML models, fine-tuned LLMs

What changes

The conversation or agent memory

The underlying data distribution

Symptom

LLM loses track of meaning

Model accuracy deteriorates

Cause

Context mismanagement or token limits

Changing world, new behavior patterns

Fix

Better memory, retrieval, orchestration

Monitoring, retraining, versioning

Detected by

Agent loops, inconsistent answers

Drift dashboards, performance decline

🎯

  • Context Drift → “Agent forgets the task.”

  • Model Drift → “Model forgets the world.”


minimum similarity??context rehydration on every step

===

1. What Is Minimum Similarity (in RAG)?

When a user query comes in, the system performs vector search to find the most relevant documents.

Each document returns a similarity score (typically cosine similarity) → between 0 and 1.

Example:

  • 0.92 = Highly relevant

  • 0.75 = Maybe relevant

  • 0.52 = Low relevance

  • 0.20 = Not relevant

If you don’t set a minimum similarity threshold, RAG may return wrong or unrelated documents.This leads to context drift, hallucination, and wrong answers.

✅ Minimum Similarity Threshold

You define a cut-off score like:

  • 0.80+ → Only return if highly relevant

  • 0.70–0.80 → Return with caution (fallback context)

  • <0.70 → Don’t return any document, use fallback response

Why it matters

Without minimum similarity:

  • LLM receives irrelevant documents

  • It uses incorrect context

  • It “drifts” into the wrong answer

  • Agents produce incorrect actions

Example for BFSI

If user asks:“What is the maximum LRS limit for outward remittance?”

But vector search returns an unrelated RBI circular (low similarity like 0.45),the LLM will answer incorrectly.

Setting similarity threshold protects you from wrong grounding.

🔧 Recommended Thresholds

Domain

Threshold

BFSI / Regulatory

0.85

Technical KB

0.80

General FAQ

0.75

High-risk domains require higher thresholds.

🎯

“Minimum similarity ensures that RAG only returns documents that are truly relevant. Anything below the threshold is ignored to avoid hallucinations and context drift.”

2. What Is Context Rehydration on Every Step?

In multi-step agent workflows, the LLM can’t remember the entire conversation because:

  • Token limits

  • Summaries lose details

  • The agent works across many tools and calls

  • State gets fragmented

Context Rehydration means:

At every step of the agent or workflow, we rebuild the full state of the conversation/task from a canonical memory store, instead of relying only on the LLM’s temporary memory.

The principle:

“Never trust the LLM to remember — always rehydrate the context.”

🧠 How it works (step-by-step)

Step 1 — Store structured state

Store state in:

  • Redis

  • Postgres

  • MongoDB

  • Vector DB with metadata

  • Agent memory store

Example:

{
  "customerId": "12345",
  "loanAmount": "15L",
  "riskScore": "Moderate",
  "documents": ["PAN.pdf", "Aadhaar.pdf"],
  "lastAction": "RiskAgentCompleted"
}

Step 2 — Every time an agent runs

Before calling the model →pull fresh state from the memory store.

Step 3 — Rebuild prompt with accurate context

Inject the complete state into the agent’s input:

You are the Loan Evaluation Agent.
Here is the complete state:

- Customer ID: 12345
- Risk Score: Moderate
- Last step completed: RiskAgent
- Required documents: PAN, Aadhaar
- Missing documents: Bank statement

Step 4 — Agent produces next step

No forgettingNo driftNo contradiction

Step 5 — Save updated state back

After each decision, update the state and save it.

🎯 Why Context Rehydration Prevents Drift

Without rehydration:

  • Agent forgets previous values

  • Agents contradict each other

  • Workflow loops (infinite loops)

  • Wrong decisions based on stale memory

With rehydration:

  • Deterministic

  • Repeatable

  • Auditable

  • Stable

  • No drift

🏦 Example in Digital Lending

Underwriting Agent receives fresh state:

  • All documents extracted

  • CIBIL score = 780

  • Fraud score = Low

  • Loan ask = 15 lakh

  • Income verified = Yes

This ensures:

  • accurate risk decision

  • no confusion between customers

  • consistent underwriting

🎯

“Context rehydration means the agent never depends on temporary model memory. On every step, it re-pulls the entire state from an authoritative store and reconstructs the full context before reasoning. This prevents loops, drift, and inconsistencies in multi-agent workflows.”


FIXING CONTEXT & DATA DRIFT IN A RAG SYSTEM

🎯 Business Use Case

We built a Loan Policy Q&A RAG system for the operations team:

  • Source: RBI circulars + internal lending policy PDFs

  • Users: Credit officers & branch ops

  • Stack:

    • Ingestion → Chunking → Embeddings → PGVector

    • Query → Hybrid Search → Re-ranking → LLM (GPT)

🚨 PRODUCTION ISSUE OBSERVED (DRIFT SYMPTOMS)

After 2–3 months in production:

  • Users reported:

    • “System is giving outdated answers”

    • “Prepayment charges look wrong”

  • Evaluation logs showed:

    • High semantic similarity

    • But business incorrect answers

  • Ground truth:

    • Policy document was updated

    • But old embeddings were still being retrieved

This is a classic case of:

Knowledge Drift + Embedding Drift + Metadata Drift

STEP-BY-STEP HOW I DIAGNOSED THE DRIFT

1️⃣ Detected via Monitoring

We had:

  • Query logs

  • Top-k retrieved chunks

  • Final answer

  • Human feedback flag (thumbs down)

Pattern observed:

  • Same chunk IDs repeatedly retrieved

  • Despite document version being updated in SharePoint

2️⃣ Root Cause Analysis

We found 3 concrete causes:

Drift Type

Cause

Data Drift

New PDF uploaded but not re-ingested

Embedding Drift

Old document chunks still in vector DB

Metadata Drift

Document version field not updated

So the system was:✅ Semantically correct❌ Temporally incorrect

PRODUCTION FIX IMPLEMENTED (HARD GUARANTEES)

✅ 1. Document Versioning at Ingestion

Every document stored as:

doc_id
doc_version
doc_hash (SHA-256)
upload_timestamp
source_system

Now version becomes part of retrieval filter.

✅ 2. Delta Re-Embedding Pipeline (Not Full Rebuild)

Instead of re-embedding everything:

For each incoming document:

  • Compute hash of each chunk

  • Compare with previous chunk hashes

  • Only:

    • New chunks → Embed

    • Modified chunks → Re-embed

    • Deleted chunks → Soft delete vectors

This reduced:✅ Cost by ~70%✅ Ingestion time by ~60%

✅ 3. Hard Metadata Filters During Retrieval

At query time, we enforced:

  • doc_version = latest

  • is_active = true

  • business_unit = lending

So even if old vectors exist, they are never retrieved.

✅ 4. Similarity + Time Decay Scoring

Final relevance score:

final_score = (semantic_score * 0.8) + (recency_score * 0.2)

This ensures:

  • Newer policies naturally rank higher

  • Even if semantics are similar

✅ 5. Post-Answer Drift Validation

After the LLM generates the answer:

  • We run a:

    • Claim-to-chunk verification

  • If claim is not supported by:

    • Latest version only

  • We reject the answer and respond:

“This policy has been recently updated. Please refer to the latest document.”

✅ 6. Automated Drift Detection Job (Nightly)

A scheduled job checks:

  • Vector DB vs Document Store

  • Missing re-embeddings

  • Version mismatches

  • Orphan vectors

And raises:

  • Slack alerts

  • Jira tickets automatically

FINAL PRODUCTION RESULT

Metric

Before

After

Policy answer accuracy

71%

96%

User complaints

High

Near zero

Hallucination rate

18%

< 3%

Re-ingestion cost

High

Optimized

“In production, we observed RAG drift when updated loan policy documents were not reflected in answers. I diagnosed this as data drift and embedding drift caused by missing version control. We fixed it by introducing document versioning, delta re-embedding using chunk hashes, strict metadata filters during retrieval, and recency-weighted scoring. We also added a nightly drift audit job and post-answer claim validation. This reduced hallucinations below 3% and restored business trust.”

“We fixed RAG drift by enforcing document versioning, delta re-embedding, recency-aware ranking, and post-answer verification — turning a failing production system into a trusted enterprise AI platform.”

RAG CONTEXT DRIFT — DEBUGGING CHECKLIST

(Context drift = right model, wrong or outdated knowledge)

1️⃣ DOCUMENT & INGESTION CHECK

First confirm whether the source truth changed:

  • ☐ Has the source document changed? (PDF, policy, KB, DB)

  • ☐ Is there a new version or amendment?

  • ☐ Was ingestion pipeline triggered after the change?

  • ☐ Is last_ingested_timestamp ≥ last_updated_timestamp?

✅ If NO, root cause = data pipeline failure

2️⃣ CHUNKING DRIFT CHECK

Poor chunking causes semantic distortion:

  • ☐ Are chunk boundaries still valid after doc update?

  • ☐ Any headers merged with wrong sections?

  • ☐ Chunk size increased beyond embedding sweet spot (512–1024 tokens)?

  • ☐ Overlap missing → context break?

✅ If YES, root cause = structural context drift

3️⃣ EMBEDDING DRIFT CHECK

Embedding mismatch leads to silent failure:

  • ☐ Was embedding model changed?

  • ☐ Are old and new versions mixed?

  • ☐ Same tokenizer used for ingestion & query?

  • ☐ Any silent fallback to another embedding model?

✅ If YES, root cause = vector space corruption

4️⃣ VECTOR STORE INTEGRITY

Retrieve-level corruption:

  • ☐ Are deleted docs still returning?

  • ☐ Is soft-delete respected?

  • ☐ Are duplicate vectors present?

  • ☐ Metadata filters applied correctly?

✅ If NO, root cause = retrieval contamination

5️⃣ RETRIEVAL QUALITY CHECK

Validate actual retrieval, not just generation:

  • ☐ Are top-5 chunks actually relevant on manual review?

  • ☐ Similarity scores abnormally high for wrong content?

  • ☐ Too many low-score chunks crossing threshold?

✅ If YES, root cause = retriever ranking drift

6️⃣ RERANKER HEALTH CHECK

Cross-encoder or reranker failure:

  • ☐ Is reranker enabled?

  • ☐ Model updated silently?

  • ☐ Latency spike causing timeout fallback?

  • ☐ Reranker scores uniformly flat?

✅ If YES, root cause = precision layer collapse

7️⃣ PROMPT CONTEXT INTEGRITY

Sometimes drift is prompt, not data:

  • ☐ Is full context injected?

  • ☐ Any truncation due to token limits?

  • ☐ Instruction order changed?

  • ☐ Context accidentally summarized before injection?

✅ If YES, root cause = prompt injection loss

8️⃣ POST-ANSWER VERIFICATION

Grounding failure detection:

  • ☐ Does each generated claim map to a retrieved chunk?

  • ☐ Any sentence unsupported by source?

  • ☐ Is citation missing?

✅ If YES, root cause = grounding enforcement failure

MODEL DRIFT — DEBUGGING CHECKLIST

(Model drift = same data, different behavior)

1️⃣ MODEL VERSION CHANGE

First check if the brain changed:

  • ☐ Was the LLM upgraded?

  • ☐ Silent version switch by provider?

  • ☐ Context window reduced?

  • ☐ Safety layer tightened?

✅ If YES, root cause = provider-side model drift

2️⃣ PROMPT REGRESSION CHECK

Most model drift is actually prompt drift:

  • ☐ Any change in system prompts?

  • ☐ Added new guardrails?

  • ☐ Tool calling schema changed?

  • ☐ Role order changed?

✅ If YES, root cause = instruction drift

3️⃣ TEMPERATURE & SAMPLING

Behavior shifts from decoding params:

  • ☐ Temperature increase?

  • ☐ Top-p changed?

  • ☐ Frequency penalty modified?

✅ If YES, root cause = generation randomness drift

4️⃣ TOOL CALLING & AGENT BEHAVIOR

Agent loops often start here:

  • ☐ Tool response format changed?

  • ☐ Schema mismatch?

  • ☐ Failure retried without state?

  • ☐ STOP condition removed?

✅ If YES, root cause = agent execution drift

5️⃣ EVALUATION SCORE REGRESSION

Objective detection:

  • ☐ Drop in exact match?

  • ☐ Increase in ungrounded claims?

  • ☐ Spike in refusal / “I don’t know”?

  • ☐ Human feedback trend worsened?

✅ If YES, root cause = behavioral performance drift

First I confirm whether the source data changed. Then I validate embedding and retrieval consistency. Next I check reranker and prompt injection integrity. If data is correct, I validate if the LLM or decoding parameters changed. Finally, I verify agent tool execution and stop conditions.

  • Context Drift → “My knowledge is outdated”

  • Model Drift → “My reasoning style changed”


“When we see wrong answers in production, I first classify it as either context drift or model drift. For context drift, I trace ingestion freshness, chunking integrity, embedding model consistency, vector store contamination, retrieval ranking, and prompt injection completeness. For model drift, I check LLM version changes, prompt regressions, decoding parameters, tool execution stability, and evaluation score drops. This layered checklist lets us isolate drift in minutes instead of days.”

 
 
 

Recent Posts

See All
Best Chunking Practices

1. Chunk by Semantic Boundaries (NOT fixed size only) Split by sections, headings, paragraphs , or logical units. Avoid cutting a sentence or concept in half. Works best with docs, tech specs, policie

 
 
 
Future State Architecture

USE CASE: LARGE RETAIL BANK – DIGITAL CHANNEL MODERNIZATION 🔹 Business Context A large retail bank wants to “modernize” its digital channels (internet banking + mobile apps). Constraints: Heavy regul

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page