AI Best Practices

Anand Nerurkar
Nov 26, 2025
4 min read

1) Debugging production agent blockers — a systematic checklist

When you see symptoms like “agent looping” or “context drift”, follow this rapid diagnostic flow:

A. Reproduce & Observe

Capture the full conversation transcript + timestamps + request/response IDs.
Reproduce with same inputs in a controlled environment (staging) and record all agent steps.

B. Check orchestration / control flow

Verify orchestrator state machine: are termination conditions implemented? (max_steps, stop_signals, timeouts)
Look for missing or incorrect done/success signals between agents.
Ensure message deduplication (idempotency) — looping often caused by re-processing same event.

C. Inspect prompts & tool calls

Dump the prompt templates and interpolated inputs used for each agent invocation.
Confirm tool responses are valid and not causing agent to requeue the same task.

D. Trace context propagation

Confirm context is passed correctly (session id, step id, vector DB pointers).
Check for context truncation (token limits) or accidental resets.

E. Observe model outputs for hallucination/uncertainty

Look for repetitious tokens, contradictions, or “I don’t know” loops.
Check temperature/response randomness settings — high temperature can cause wandering behaviors.

F. Telemetry & Logs to examine

Per-request logs: prompt, model config (temp, top_p, max_tokens), returned tokens, latency, tool calls.
Orchestration traces: which agent called which tool and with which outcome.
Vector DB access: query, similarity score, doc id returned.
Business metrics: retries, average steps per request, % escalations to human.

G. Quick fixes

Add step/time limits (max_steps, TTL).
Add sanity checks: if state unchanged for N steps, escalate/abort.
Harden stop conditions and guardrails in logic, not only in prompt.

2) Diagnosing context drift (why agent loses track)

Symptoms: agent reference wrong entity, forgets previous facts, or uses stale data.

Causes & checks

Token truncation: conversation exceeds model context window — check size and implement summarization or RAG.
Bad retrieval: retrieval returns low-similarity docs; check similarity scores and recall.
Stateless calls: orchestrator not re-supplying required context (user profile, last actions).
Inconsistent representation: multiple embeddings or vector DBs with different encoders.

Mitigations

Use rolling summaries (condense older history into summary + store in memory).
Store canonical state in a lightweight session DB and pass only needed pieces to model.
Use retrieval filters (metadata + recency) and require min similarity threshold.
Detect drift: log semantic similarity between current prompt and last-step context; if below threshold, rehydrate state.

3) Observability & instrumentation (must-haves)

Structured logs: JSON with fields {request_id, user_id, agent_id, step, prompt_hash, model, temp, embedding_id, similarity_score, latency, outcome}
Tracing: distributed tracing across orchestrator → model → tool calls (Jaeger/OpenTelemetry).
Metrics: steps-per-request, avg similarity, avg token usage, failure rate, human escalations.
Alerts: spike in avg steps, falling similarity, high latency, increased retries.
Playback & replay: ability to replay any request through staging with identical inputs.

4) AI Engineering Best Practices — teachable checklist

Prompt versioning

Keep prompts as code artifacts, versioned in Git.
Tag prompt versions with semantic version (v1.0.0), and include changelog.
Use feature flags / rollout to compare new prompt versions with a small % of traffic.

Evaluation

Unit tests for prompts (golden inputs → expected outputs or properties).
Continuous evaluation: run a battery of tests (accuracy, safety, faithfulness) nightly.
Use human-in-the-loop evaluations for subjective metrics; maintain labelled corpora.

Retrieval tuning

Log top-K results + similarity scores.
Tune top_k, threshold, and re-rankers; consider hybrid search (sparse + dense).
Use chunking strategy that matches chunk size to model context and semantic unit.

Logging & observability

Log raw prompts (or hashed), model outputs, embeddings ids, top-K docs, similarity.
Capture red-team and safety-related flags.
Ensure PII handling: mask or redact logs per compliance rules.

Testing

Unit: prompt-level expectations (format, presence of key facts).
Integration: agent + retrieval + tool interactions.
Regression: run previous incidents through new model/prompt to prevent recurrence.
Stress: concurrency and latency under load.

Versioned deployments

Canary rollout for new models/prompts.
Shadow testing (run new model in parallel, do not serve output).
Automated rollback on regressions.

5) Retrieval & vector DB tuning

Embedding model consistency: Use same encoder for ingestion & queries.
Chunk size & stride: chunk at semantic boundaries; avoid too large chunks.
Top-K vs threshold: return top-K then filter by min similarity; fallback to QA chains if below threshold.
Hybrid search: combine BM25/Elastic with vector search for recall.
Re-ranker: use lightweight cross-encoder or model-based relevance scoring to re-rank candidates before prompt.

Monitoring: track fraction of queries with similarity < X — set alert.

6) Model-tool fit: how to evaluate & recommend (OpenAI / Anthropic / Gemini / self-hosted)

Use a short evaluation matrix — judge across Capability, Safety, Cost, Latency, Compliance, Fine-tuning/PII control, Ecosystem.

Criteria

Task type: generation, summarization, code, instruction-following, reasoning.
Safety & guardrails: built-in safety layers and moderation.
Latency & throughput: interactive vs bulk jobs.
Customization: fine-tuning, embeddings, system prompts, RAG support.
Data residency & compliance: can you keep training data private? enterprise contracts?
Cost: per-token cost + inference cost at scale.
Ecosystem & tooling: SDKs, monitoring, prompt tuning tools, connectors.
Model license & openness: ability to self-host vs API-only.

Example recommendations (rule-of-thumb)

OpenAI (GPT family) — Great for high-quality generation, broad capabilities, strong ecosystem, but check enterprise data residency & cost. Good for chatbots, summarization, code generation.
Anthropic — Strong safety-first models (Claude) — good when safety and instruction-following constraints are paramount.
Gemini — (if available) strong multimodal & reasoning; evaluate latency/cost tradeoffs.
Self-hosted LLMs (LlamaX, Mistral, etc.) — Good when data residency, fine-tuning on private corpora, or cost control is crucial — but needs infra and MLOps investment.

Decision process

Define top-3 business priorities (accuracy, latency, cost).
Run a benchmark suite on representative workloads (generation quality, instruction fidelity, hallucination metrics, latency).
Evaluate safety/regulatory posture and contract terms.
Select candidate(s) and run canary + shadow deploys.
Continuously re-evaluate — model landscape evolves fast.

7) Operational patterns to prevent looping & drift

Deterministic guardrails: max steps, step-idempotency checks, loop counters.
State diffs: if agent state doesn’t change across steps, abort + escalate.
Confidence thresholds: if low confidence, escalate to human or fallback.
Tool-call validation: validate tool outputs and add sanity checks before feeding back to agent.
Human-in-the-loop gates for high-risk actions.

8) Quick playbook you can teach the team (one-liner checklist)

Version prompts & models in Git.
Log prompts, model params, tool responses, embeddings, and similarity scores.
Canary & shadow new models/prompts.
Implement step/time limits and loop detectors.
Use hybrid retrieval + re-ranker and monitor similarity distribution.
Automate nightly evaluation suites + human review for edge cases.
Alert on rising steps-per-request, falling similarity, and escalation rates.