How to debug issue for AI System??
- Anand Nerurkar
- Nov 24
- 7 min read
Updated: Nov 26
1) Debugging production agent blockers — a systematic checklist
When you see symptoms like “agent looping” or “context drift”, follow this rapid diagnostic flow:
A. Reproduce & Observe
Capture the full conversation transcript + timestamps + request/response IDs.
Reproduce with same inputs in a controlled environment (staging) and record all agent steps.
B. Check orchestration / control flow
Verify orchestrator state machine: are termination conditions implemented? (max_steps, stop_signals, timeouts)
Look for missing or incorrect done/success signals between agents.
Ensure message deduplication (idempotency) — looping often caused by re-processing same event.
C. Inspect prompts & tool calls
Dump the prompt templates and interpolated inputs used for each agent invocation.
Confirm tool responses are valid and not causing agent to requeue the same task.
D. Trace context propagation
Confirm context is passed correctly (session id, step id, vector DB pointers).
Check for context truncation (token limits) or accidental resets.
E. Observe model outputs for hallucination/uncertainty
Look for repetitious tokens, contradictions, or “I don’t know” loops.
Check temperature/response randomness settings — high temperature can cause wandering behaviors.
F. Telemetry & Logs to examine
Per-request logs: prompt, model config (temp, top_p, max_tokens), returned tokens, latency, tool calls.
Orchestration traces: which agent called which tool and with which outcome.
Vector DB access: query, similarity score, doc id returned.
Business metrics: retries, average steps per request, % escalations to human.
G. Quick fixes
Add step/time limits (max_steps, TTL).
Add sanity checks: if state unchanged for N steps, escalate/abort.
Harden stop conditions and guardrails in logic, not only in prompt.
2) Diagnosing context drift (why agent loses track)
Symptoms: agent reference wrong entity, forgets previous facts, or uses stale data.
Causes & checks
Token truncation: conversation exceeds model context window — check size and implement summarization or RAG.
Bad retrieval: retrieval returns low-similarity docs; check similarity scores and recall.
Stateless calls: orchestrator not re-supplying required context (user profile, last actions).
Inconsistent representation: multiple embeddings or vector DBs with different encoders.
Mitigations
Use rolling summaries (condense older history into summary + store in memory).
Store canonical state in a lightweight session DB and pass only needed pieces to model.
Use retrieval filters (metadata + recency) and require min similarity threshold.
Detect drift: log semantic similarity between current prompt and last-step context; if below threshold, rehydrate state.
3) Observability & instrumentation (must-haves)
Structured logs: JSON with fields {request_id, user_id, agent_id, step, prompt_hash, model, temp, embedding_id, similarity_score, latency, outcome}
Tracing: distributed tracing across orchestrator → model → tool calls (Jaeger/OpenTelemetry).
Metrics: steps-per-request, avg similarity, avg token usage, failure rate, human escalations.
Alerts: spike in avg steps, falling similarity, high latency, increased retries.
Playback & replay: ability to replay any request through staging with identical inputs.
debug hallucination or agent looping--- pls take one example and expalin step by step
=====
I’ll walk a single, concrete incident end-to-end so you can explain it in an interview. I’ll use a realistic BFSI-flavored example where an assistant agent that calls a search_docs() tool gets into a loop and returns a hallucinated regulatory clause. I’ll show how to detect, reproduce, triage root cause, implement fixes (short-term + long-term), test, and monitor.
Example incident (one-liner)
A GenAI agent answering a compliance question about “early repayment penalty for personal loans” repeatedly calls search_docs() and then returns a confident but made-up clause claiming RBI rule X requires a 2% penalty — the claim is false. The agent also loops calling search_docs() 8 times before returning.
1) Detection — how you notice the problem
User flags the answer as incorrect (or a compliance reviewer raises it).
Automated monitor: “unsupported-claim” alerts fired because answer had no provenance or supporting spans.
Observability trace shows the agent made 8 tool calls (exceeded normal 1–3) and latency spike.
What to collect immediately:
Full trace: prompt, model response, tool inputs/outputs, retriever candidate IDs, prompt version, model params (temperature, max_tokens).
Retrieval top-k results (IDs + similarity scores).
Tool logs (search_docs responses).
Agent step timestamps.
2) Reproduce (quickly, in dev)
Re-run the exact trace with the saved inputs (prompt + tool outputs) against the same model and tool stubs to reproduce the loop and hallucination.
If trace not saved, ask user to reproduce with verbose logging enabled: AGENT_TRACE=true then reproduce.
Why reproduce: confirms deterministic failure mode and produces exact artifacts to analyze.
3) Root-cause analysis — follow a checklist in order
A. Inspect retrieval results
Were retrieved documents relevant?— Query vector: SELECT id, score, metadata FROM vectors WHERE qid = '<trace_query_id>';
Observation (our example): retriever returned 1 highly similar doc about loan foreclosure (not early repayment) and 2 low-quality forum pages.
Conclusion: retrieval returned no direct source about “early repayment penalty”.
B. Inspect reranker / ranking
Was reranker enabled? If yes, did reranker demote relevant docs?— In example: reranker disabled in that environment (cost-saving).
C. Inspect tool output and parsing
Did search_docs() parse OCR or HTML poorly and return garbled snippets?— Example: search_docs() returned an API summary: "charges may apply" — ambiguous.
D. Inspect prompt & agent logic
Is the agent allowed to answer without explicit evidence?— Example: prompt did not require citation; system message: “Answer helpfully.”
Agent policy for tool calls: does it have step limits, or circuit breaker?— Example: none — so it kept querying hoping for better docs.
E. Model generation settings
Temperature/top_p?— Example: temperature = 0.7 (too high for factual answers).
Any post-generation verification step?— Example: none.
F. Detect loop cause
Why repeated search_docs() calls?— Agent logic: if tool returns “no clear answer”, it tries alternative search terms (synonyms) and keeps iterating. Because no good doc exists, it never satisfied "answer found" condition and looped until max steps (which was high).
G. Why hallucination happened
Without supporting text and with a permissive prompt + high temperature, model invented a clause to be helpful and sounded confident.
4) Immediate tactical fixes (fast, low-risk)
Apply these during the live incident or in a hotfix release.
Circuit breaker & step cap
Enforce MAX_TOOL_CALLS = 3. If exceeded, return safe fallback: “I don’t have enough information; please consult compliance.”
Pseudocode:
if tool_calls > MAX_TOOL_CALLS: return "I don't have enough supported evidence to answer. Escalating to human review."
Force evidence requirement in prompt
Change system prompt to:
“Only answer if you can cite a supporting document. If you cannot find a supporting passage, respond: ‘Insufficient documented evidence’.”
Lower temperature for factual paths
Route factual/legal Qs to temperature = 0.0–0.2.
Return provenance only
If no direct snippet matches, return retrieved doc list and ask human reviewer.
Add immediate monitor rule
Alert on any response with >0 unsupported assertions and with tool_call_count > 3.
These fixes prevent further bad answers while you build longer-term solutions.
5) Medium/long-term fixes (design & engineering)
A. Improve retrieval quality
Use hybrid search (BM25 + dense embeddings).
Increase k for candidate set, then apply cross-encoder reranker to top-N for precision.
Add domain-specific embedding model or fine-tune embeddings on regulatory text.
B. Add extractive verification step (post-generation)
For each factual claim extracted from model output, check if claim maps to a span in retrieved docs:
Extract claims (named entities, dates, percentages) via an information-extraction step.
For each claim, run fuzzy span matching or embed the claim and run similarity vs candidate spans.
If match similarity < threshold, mark claim unsupported.
If any claim unsupported → force fallback.
Pseudocode:
claims = extract_claims(generated_text)
for claim in claims:
if not find_supporting_span(claim, retrieved_docs, thresh=0.78):
flagged = True
if flagged:
return "Insufficient documented evidence."
C. Agent policy & tooling
Add explicit tool-response schemas and validators — e.g., search_docs() returns {"snippets":[{"text":..., "score":...}], "status":"success"} and validate.
Implement loop detection: detect repeated queries with similar intent and stop.
Use tool-call journaling with reasoning steps logged for audit.
D. Prompt & model orchestration
Maintain prompt templates with {{evidence}} injection and {{instruction}} that forbids unsupported generation.
Keep prompt/versioning and enforce AB testing of prompt changes with regression suite.
E. QA & CI
Create a hallucination testset (200 queries) and run nightly checks — regression block if hallucination rate increases.
Add unit tests for agent logic (simulate tool returns and ensure fallback behavior).
F. Human-in-the-loop & escalation
For legal/regulatory categories, route answers to human reviewer before publishing.
Provide reviewers with candidate snippets highlighted and confidence.
6) Implementation artifacts — concrete examples you can mention
Sample safe system prompt
System: “You are an assistant for ABC compliance. For any legal/regulatory question you must only answer when you can cite a document and quote the supporting passage (≤200 chars). If you cannot find a supporting passage, respond exactly: ‘Insufficient documented evidence — escalate to legal.’ Do not invent clauses.”
Loop counter code snippet
class AgentSession:
def __init__(self):
self.tool_calls = 0
self.seen_queries = set()
def call_tool(self, tool_name, query):
self.tool_calls += 1
if self.tool_calls > 3:
raise CircuitBreaker("too many tool calls")
key = (tool_name, normalize(query))
if key in self.seen_queries:
raise CircuitBreaker("repetitive query detected")
self.seen_queries.add(key)
return run_tool(tool_name, query)
Simple extractive verifier (concept)
Use sentence-level embedding similarity: embed each sentence in retrieved docs and embed each extracted claim; check cosine similarity > 0.8.
7) Testing & rollout
Unit tests for:
Circuit breaker triggers.
Extractive verifier flags unsupported claims.
Prompt change enforced.
Integration test: run prior broken trace — should now return fallback with logged reason.
Canary roll-out: enable changes for 5% of traffic, monitor hallucination and support-request metrics.
Full rollout after metrics stable for N days.
8) Monitoring & ops (post-fix)
Dashboards:
Hallucination rate (claims without spans) — target < 1% for legal queries.
Tool call distribution (median ~1, P95 < 3).
Circuit-breaker events.
Alerts:
Hallucination rate spike > X% in 1 hour.
Circuit-breaker rate increases — investigate retrieval regressions.
Postmortem: capture incident, root cause, fix timeline, and add to knowledge base.
9)
“We had an agent that looped through the search tool and ultimately hallucinated a regulatory clause. Repro and trace logs showed retrieval returned no supporting docs, the prompt allowed free-form answers, and temperature was high. Short-term I added a circuit breaker, forced evidence-first prompts, and lowered temperature. Medium-term we added a reranker, extractive verification (claims → span matching), and human-in-the-loop for legal answers. After fixes, hallucination alerts dropped by ~90% and circuit-breaker incidents dropped significantly after improving retrieval.”
.png)

Comments