top of page

AI Best Practices

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Nov 26
  • 4 min read

1) Debugging production agent blockers — a systematic checklist

When you see symptoms like “agent looping” or “context drift”, follow this rapid diagnostic flow:

A. Reproduce & Observe

  • Capture the full conversation transcript + timestamps + request/response IDs.

  • Reproduce with same inputs in a controlled environment (staging) and record all agent steps.

B. Check orchestration / control flow

  • Verify orchestrator state machine: are termination conditions implemented? (max_steps, stop_signals, timeouts)

  • Look for missing or incorrect done/success signals between agents.

  • Ensure message deduplication (idempotency) — looping often caused by re-processing same event.

C. Inspect prompts & tool calls

  • Dump the prompt templates and interpolated inputs used for each agent invocation.

  • Confirm tool responses are valid and not causing agent to requeue the same task.

D. Trace context propagation

  • Confirm context is passed correctly (session id, step id, vector DB pointers).

  • Check for context truncation (token limits) or accidental resets.

E. Observe model outputs for hallucination/uncertainty

  • Look for repetitious tokens, contradictions, or “I don’t know” loops.

  • Check temperature/response randomness settings — high temperature can cause wandering behaviors.

F. Telemetry & Logs to examine

  • Per-request logs: prompt, model config (temp, top_p, max_tokens), returned tokens, latency, tool calls.

  • Orchestration traces: which agent called which tool and with which outcome.

  • Vector DB access: query, similarity score, doc id returned.

  • Business metrics: retries, average steps per request, % escalations to human.

G. Quick fixes

  • Add step/time limits (max_steps, TTL).

  • Add sanity checks: if state unchanged for N steps, escalate/abort.

  • Harden stop conditions and guardrails in logic, not only in prompt.

2) Diagnosing context drift (why agent loses track)

Symptoms: agent reference wrong entity, forgets previous facts, or uses stale data.

Causes & checks

  • Token truncation: conversation exceeds model context window — check size and implement summarization or RAG.

  • Bad retrieval: retrieval returns low-similarity docs; check similarity scores and recall.

  • Stateless calls: orchestrator not re-supplying required context (user profile, last actions).

  • Inconsistent representation: multiple embeddings or vector DBs with different encoders.

Mitigations

  • Use rolling summaries (condense older history into summary + store in memory).

  • Store canonical state in a lightweight session DB and pass only needed pieces to model.

  • Use retrieval filters (metadata + recency) and require min similarity threshold.

  • Detect drift: log semantic similarity between current prompt and last-step context; if below threshold, rehydrate state.

3) Observability & instrumentation (must-haves)

  • Structured logs: JSON with fields {request_id, user_id, agent_id, step, prompt_hash, model, temp, embedding_id, similarity_score, latency, outcome}

  • Tracing: distributed tracing across orchestrator → model → tool calls (Jaeger/OpenTelemetry).

  • Metrics: steps-per-request, avg similarity, avg token usage, failure rate, human escalations.

  • Alerts: spike in avg steps, falling similarity, high latency, increased retries.

  • Playback & replay: ability to replay any request through staging with identical inputs.

4) AI Engineering Best Practices — teachable checklist

Prompt versioning

  • Keep prompts as code artifacts, versioned in Git.

  • Tag prompt versions with semantic version (v1.0.0), and include changelog.

  • Use feature flags / rollout to compare new prompt versions with a small % of traffic.

Evaluation

  • Unit tests for prompts (golden inputs → expected outputs or properties).

  • Continuous evaluation: run a battery of tests (accuracy, safety, faithfulness) nightly.

  • Use human-in-the-loop evaluations for subjective metrics; maintain labelled corpora.

Retrieval tuning

  • Log top-K results + similarity scores.

  • Tune top_k, threshold, and re-rankers; consider hybrid search (sparse + dense).

  • Use chunking strategy that matches chunk size to model context and semantic unit.

Logging & observability

  • Log raw prompts (or hashed), model outputs, embeddings ids, top-K docs, similarity.

  • Capture red-team and safety-related flags.

  • Ensure PII handling: mask or redact logs per compliance rules.

Testing

  • Unit: prompt-level expectations (format, presence of key facts).

  • Integration: agent + retrieval + tool interactions.

  • Regression: run previous incidents through new model/prompt to prevent recurrence.

  • Stress: concurrency and latency under load.

Versioned deployments

  • Canary rollout for new models/prompts.

  • Shadow testing (run new model in parallel, do not serve output).

  • Automated rollback on regressions.

5) Retrieval & vector DB tuning

  • Embedding model consistency: Use same encoder for ingestion & queries.

  • Chunk size & stride: chunk at semantic boundaries; avoid too large chunks.

  • Top-K vs threshold: return top-K then filter by min similarity; fallback to QA chains if below threshold.

  • Hybrid search: combine BM25/Elastic with vector search for recall.

  • Re-ranker: use lightweight cross-encoder or model-based relevance scoring to re-rank candidates before prompt.

Monitoring: track fraction of queries with similarity < X — set alert.

6) Model-tool fit: how to evaluate & recommend (OpenAI / Anthropic / Gemini / self-hosted)

Use a short evaluation matrix — judge across Capability, Safety, Cost, Latency, Compliance, Fine-tuning/PII control, Ecosystem.

Criteria

  • Task type: generation, summarization, code, instruction-following, reasoning.

  • Safety & guardrails: built-in safety layers and moderation.

  • Latency & throughput: interactive vs bulk jobs.

  • Customization: fine-tuning, embeddings, system prompts, RAG support.

  • Data residency & compliance: can you keep training data private? enterprise contracts?

  • Cost: per-token cost + inference cost at scale.

  • Ecosystem & tooling: SDKs, monitoring, prompt tuning tools, connectors.

  • Model license & openness: ability to self-host vs API-only.

Example recommendations (rule-of-thumb)

  • OpenAI (GPT family) — Great for high-quality generation, broad capabilities, strong ecosystem, but check enterprise data residency & cost. Good for chatbots, summarization, code generation.

  • Anthropic — Strong safety-first models (Claude) — good when safety and instruction-following constraints are paramount.

  • Gemini — (if available) strong multimodal & reasoning; evaluate latency/cost tradeoffs.

  • Self-hosted LLMs (LlamaX, Mistral, etc.) — Good when data residency, fine-tuning on private corpora, or cost control is crucial — but needs infra and MLOps investment.

Decision process

  1. Define top-3 business priorities (accuracy, latency, cost).

  2. Run a benchmark suite on representative workloads (generation quality, instruction fidelity, hallucination metrics, latency).

  3. Evaluate safety/regulatory posture and contract terms.

  4. Select candidate(s) and run canary + shadow deploys.

  5. Continuously re-evaluate — model landscape evolves fast.

7) Operational patterns to prevent looping & drift

  • Deterministic guardrails: max steps, step-idempotency checks, loop counters.

  • State diffs: if agent state doesn’t change across steps, abort + escalate.

  • Confidence thresholds: if low confidence, escalate to human or fallback.

  • Tool-call validation: validate tool outputs and add sanity checks before feeding back to agent.

  • Human-in-the-loop gates for high-risk actions.

8) Quick playbook you can teach the team (one-liner checklist)

  1. Version prompts & models in Git.

  2. Log prompts, model params, tool responses, embeddings, and similarity scores.

  3. Canary & shadow new models/prompts.

  4. Implement step/time limits and loop detectors.

  5. Use hybrid retrieval + re-ranker and monitor similarity distribution.

  6. Automate nightly evaluation suites + human review for edge cases.

  7. Alert on rising steps-per-request, falling similarity, and escalation rates.

 
 
 

Recent Posts

See All
How to replan- No outcome after 6 month

⭐ “A transformation program is running for 6 months. Business says it is not delivering the value they expected. What will you do?” “When business says a 6-month transformation isn’t delivering value,

 
 
 
EA Strategy in case of Merger

⭐ EA Strategy in Case of a Merger (M&A) My EA strategy for a merger focuses on four pillars: discover, decide, integrate, and optimize.The goal is business continuity + synergy + tech consolidation. ✅

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page