AI Engineering Best Practices
- Anand Nerurkar
- Nov 25
- 4 min read
AI engineering best practices: prompt versioning, evaluation, retrieval tuning, logging, testing.”
This is exactly how a GenAI Lead / Advisory Architect should answer.
✅ 1. Prompt Versioning
What it means:Treat prompts like source code — version-controlled, reviewed, tested, and released.
Why enterprises need it
Different business units use slightly different prompts
Prompts evolve with product features
One small change can break a workflow
Compliance requires audit history (banking, insurance)
Best practices
Store prompts in Git with semantic versioning → loan_policy_v3.2
Use environment-specific prompts → dev, staging, production
Maintain a prompt registry (similar to model registry)
Add metadata: author, date, change purpose, expected behavior
Track dependencies: prompt → vector store version → model version
What I teach teams
“Think of prompts as APIs. Clear contract, changelog, version, and tests.”
✅ 2. Evaluation (LLM Eval Harness)
Purpose: Ensure reliability, accuracy, safety, and regressions.
What we evaluate
Correctness
Hallucinations
Toxicity / bias
Consistency across repeated runs
Multi-agent stability
Tool-calling accuracy
How we evaluate
Golden datasets (human-approved inputs & outputs)
Unit tests for prompts
Scenario tests (KYC mismatch, fraud, edge cases)
A/B testing across prompt versions
Automatic scoring using LLM-as-a-judge
Enterprise example
For Lending Policy RAG, we create 200 test questions:KYC rules, eligibility edge cases, exceptions, RBI compliance scenarios.
This prevents regressions when documents or prompts change.
✅ 3. Retrieval Tuning (RAG Optimization)
Poor retrieval = hallucination.
What to tune
Chunking strategy
Embedding model
Similarity metric (cosine vs dot product)
Number of retrieved chunks (k)
Metadata filters
Min similarity threshold
Query rewriting
Best practices I teach
Use hybrid chunking (structural + semantic)
Store metadata → section, policy version, expiry date
Use reranking models to pick the top-most relevant chunks
Constantly run retrieval evals: Recall@k, MRR, precision
Example
If the lending policy changes, we re-embed only affected chunksand validate retrieval via automated Recall@5 metrics.
✅ 4. Logging & Observability
You cannot operate GenAI systems blindly.
What to log
Per-request logs
Prompt
Model config
Tokens in/out
Latency
Final answer
RAG logs
Retrieved documents
Score
Metadata
Agent/Tool logs
Tool calls
Inputs & outputs
Errors / retries
Business metrics
% automation
Failures → escalated to human
Distribution of query types
Why enterprises need this
Audit compliance (banking, healthcare)
Debugging unexpected decisions
SLA measurement
Model performance drift
What I teach teams
“Every LLM request should be fully reconstructible end-to-end.”
✅ 5. Testing Framework
Testing = reliability.
Types of tests
✔ Prompt Unit Tests
Check expected structure & tone.
✔ RAG Regression Tests
Ensure retrieval is not broken after re-indexing.
✔ Tool Integration Tests
Validate agent-self-calls and tool inputs/outputs.
✔ Safety Tests
Red-teaming
PII leakage tests
Jailbreak attempts
✔ Load Tests
Check concurrency, throughput, cost.
Best practices
Build an evaluation harness (Python + YAML test cases)
Automate nightly AI tests
Enforce quality gate → model cannot deploy if tests fail
“In my role, I ensure the engineering teams not only build GenAI features but build them responsibly — with prompt versioning, retrieval tuning, evaluation pipelines, advanced observability and structured AI testing. This eliminates hallucinations, prevents regressions, ensures compliance, and brings predictability to LLM-driven systems — exactly what enterprises need to operationalize AI at scale.”
✅ Top AI Testing Frameworks (Industry Standard)
1️⃣ LangSmith (from LangChain) — Most popular for LLM evals
✔ Prompt testing✔ RAG evaluation✔ Agent tracing✔ Regression testing✔ Dataset-driven evaluations✔ LLM-as-a-judge scoring✔ Observability + debugging
When to use:
You want a complete suite for prompt testing + multi-agent traces
You are using LangChain or building multi-agent systems
2️⃣ TruLens (from TruEra) — Enterprise-grade RAG evaluation
✔ Faithfulness score (hallucination detection)✔ Context relevance score✔ Groundedness testing✔ Drift detection✔ RAG trace analysis✔ Human feedback integration
When to use:
You need explainability + quality metrics (grounded, relevant, safe)
Ideal for BFSI (due to compliance maturity)
3️⃣ DeepEval — Testing-first framework for LLM unit tests
✔ YAML test definitions✔ LLM unit tests✔ End-to-end tests✔ LLM-as-a-judge scoring✔ Built-in metrics: factuality, relevance, toxicity
When to use:
You want CI/CD integration for prompts
You want pytest-style testing for LLM apps
Lightweight + easy to integrate
4️⃣ Ragas (by ExplodingGradients) — RAG-specific evaluation
✔ Computes advanced RAG metrics:
Context precision / recall
Faithfulness
Answer correctness
Hallucination score
When to use:
You want deep insights into retrieval quality
You are tuning chunking/embedding/similarity thresholds
5️⃣ PromptFoo — Great for prompt versioning + regression testing
✔ A/B testing of prompts✔ CLI runner✔ Dataset-driven eval✔ GitHub CI/CD integration✔ Prompt version diff✔ Multi-model comparison
When to use:
You want version-controlled prompt testing
You want to compare prompts across: OpenAI, Gemini, Llama, Azure
6️⃣ OpenAI Evals — Baseline evaluation harness
✔ Model performance tests✔ Scoring with GPT✔ Regression testing✔ Dataset-based evaluation
When to use:
You want basic evaluation tied directly to OpenAI models
7️⃣ Weights & Biases (W&B) — LLMOps, experiment tracking
✔ Experiment tracking✔ Benchmarks✔ Dataset management✔ Model comparisons✔ Model drift detection
When to use:
You need ML + LLM tracking in one place
You want dashboards for leadership
8️⃣ MLflow + Custom LLM Testing Pipelines
✔ Prompt versioning✔ Model registry✔ Metadata logging✔ Experiment tracking
When to use:
You have existing MLOps and want LLMOps inside that stack
You want full control / self-hosting
🎯 Summary Table — Best Tool for Each Area
Area | Best Tool(s) | Why |
Prompt Regression Testing | PromptFoo, LangSmith | A/B tests, version control |
RAG Evaluation | Ragas, TruLens | Deep retrieval scoring + groundedness |
End-to-End LLM Testing | DeepEval, LangSmith | Unit tests + scenario tests |
Enterprise Observability & Tracing | LangSmith, W&B | Traces, metrics, debugging |
Drift Detection | TruLens, W&B | Explainability + drift |
CI/CD Integration | PromptFoo, DeepEval | Light-weight, YAML-based |
Audit Compliance (Banking/Insurance) | TruLens | Explainable metrics |
For BFSI / large-scale AI projects, the best combination is:
LangSmith + TruLens + Ragas + PromptFoo
Why:
LangSmith → Multi-agent traces + prompt testing
Ragas → Retrieval quality (Recall@k, precision, context relevance)
TruLens → Hallucination & groundedness
PromptFoo → Prompt versioning & CI/CD testing
This gives you a complete AI testing ecosystem.
.png)

Comments