AI Engineering Best Practices

Anand Nerurkar
Nov 25, 2025
4 min read

AI engineering best practices: prompt versioning, evaluation, retrieval tuning, logging, testing.”

This is exactly how a GenAI Lead / Advisory Architect should answer.

✅ 1. Prompt Versioning

What it means:Treat prompts like source code — version-controlled, reviewed, tested, and released.

Why enterprises need it

Different business units use slightly different prompts
Prompts evolve with product features
One small change can break a workflow
Compliance requires audit history (banking, insurance)

Best practices

Store prompts in Git with semantic versioning → loan_policy_v3.2
Use environment-specific prompts → dev, staging, production
Maintain a prompt registry (similar to model registry)
Add metadata: author, date, change purpose, expected behavior
Track dependencies: prompt → vector store version → model version

What I teach teams

“Think of prompts as APIs. Clear contract, changelog, version, and tests.”

✅ 2. Evaluation (LLM Eval Harness)

Purpose: Ensure reliability, accuracy, safety, and regressions.

What we evaluate

Correctness
Hallucinations
Toxicity / bias
Consistency across repeated runs
Multi-agent stability
Tool-calling accuracy

How we evaluate

Golden datasets (human-approved inputs & outputs)
Unit tests for prompts
Scenario tests (KYC mismatch, fraud, edge cases)
A/B testing across prompt versions
Automatic scoring using LLM-as-a-judge

Enterprise example

For Lending Policy RAG, we create 200 test questions:KYC rules, eligibility edge cases, exceptions, RBI compliance scenarios.

This prevents regressions when documents or prompts change.

✅ 3. Retrieval Tuning (RAG Optimization)

Poor retrieval = hallucination.

What to tune

Chunking strategy
Embedding model
Similarity metric (cosine vs dot product)
Number of retrieved chunks (k)
Metadata filters
Min similarity threshold
Query rewriting

Best practices I teach

Use hybrid chunking (structural + semantic)
Store metadata → section, policy version, expiry date
Use reranking models to pick the top-most relevant chunks
Constantly run retrieval evals: Recall@k, MRR, precision

Example

If the lending policy changes, we re-embed only affected chunksand validate retrieval via automated Recall@5 metrics.

✅ 4. Logging & Observability

You cannot operate GenAI systems blindly.

What to log

Per-request logs
- Prompt
- Model config
- Tokens in/out
- Latency
- Final answer
RAG logs
- Retrieved documents
- Score
- Metadata
Agent/Tool logs
- Tool calls
- Inputs & outputs
- Errors / retries
Business metrics
- % automation
- Failures → escalated to human
- Distribution of query types

Why enterprises need this

Audit compliance (banking, healthcare)
Debugging unexpected decisions
SLA measurement
Model performance drift

What I teach teams

“Every LLM request should be fully reconstructible end-to-end.”

✅ 5. Testing Framework

Testing = reliability.

Types of tests

✔ Prompt Unit Tests

Check expected structure & tone.

✔ RAG Regression Tests

Ensure retrieval is not broken after re-indexing.

✔ Tool Integration Tests

Validate agent-self-calls and tool inputs/outputs.

✔ Safety Tests

Red-teaming
PII leakage tests
Jailbreak attempts

✔ Load Tests

Check concurrency, throughput, cost.

Best practices

Build an evaluation harness (Python + YAML test cases)
Automate nightly AI tests
Enforce quality gate → model cannot deploy if tests fail

“In my role, I ensure the engineering teams not only build GenAI features but build them responsibly — with prompt versioning, retrieval tuning, evaluation pipelines, advanced observability and structured AI testing. This eliminates hallucinations, prevents regressions, ensures compliance, and brings predictability to LLM-driven systems — exactly what enterprises need to operationalize AI at scale.”

✅ Top AI Testing Frameworks (Industry Standard)

1️⃣ LangSmith (from LangChain) — Most popular for LLM evals

✔ Prompt testing✔ RAG evaluation✔ Agent tracing✔ Regression testing✔ Dataset-driven evaluations✔ LLM-as-a-judge scoring✔ Observability + debugging

When to use:

You want a complete suite for prompt testing + multi-agent traces
You are using LangChain or building multi-agent systems

2️⃣ TruLens (from TruEra) — Enterprise-grade RAG evaluation

✔ Faithfulness score (hallucination detection)✔ Context relevance score✔ Groundedness testing✔ Drift detection✔ RAG trace analysis✔ Human feedback integration

When to use:

You need explainability + quality metrics (grounded, relevant, safe)
Ideal for BFSI (due to compliance maturity)

3️⃣ DeepEval — Testing-first framework for LLM unit tests

✔ YAML test definitions✔ LLM unit tests✔ End-to-end tests✔ LLM-as-a-judge scoring✔ Built-in metrics: factuality, relevance, toxicity

When to use:

You want CI/CD integration for prompts
You want pytest-style testing for LLM apps
Lightweight + easy to integrate

4️⃣ Ragas (by ExplodingGradients) — RAG-specific evaluation

✔ Computes advanced RAG metrics:

Context precision / recall
Faithfulness
Answer correctness
Hallucination score

When to use:

You want deep insights into retrieval quality
You are tuning chunking/embedding/similarity thresholds

5️⃣ PromptFoo — Great for prompt versioning + regression testing

✔ A/B testing of prompts✔ CLI runner✔ Dataset-driven eval✔ GitHub CI/CD integration✔ Prompt version diff✔ Multi-model comparison

When to use:

You want version-controlled prompt testing
You want to compare prompts across: OpenAI, Gemini, Llama, Azure

6️⃣ OpenAI Evals — Baseline evaluation harness

✔ Model performance tests✔ Scoring with GPT✔ Regression testing✔ Dataset-based evaluation

When to use:

You want basic evaluation tied directly to OpenAI models

7️⃣ Weights & Biases (W&B) — LLMOps, experiment tracking

✔ Experiment tracking✔ Benchmarks✔ Dataset management✔ Model comparisons✔ Model drift detection

When to use:

You need ML + LLM tracking in one place
You want dashboards for leadership

8️⃣ MLflow + Custom LLM Testing Pipelines

✔ Prompt versioning✔ Model registry✔ Metadata logging✔ Experiment tracking

When to use:

You have existing MLOps and want LLMOps inside that stack
You want full control / self-hosting

🎯 Summary Table — Best Tool for Each Area

Area	Best Tool(s)	Why
Prompt Regression Testing	PromptFoo, LangSmith	A/B tests, version control
RAG Evaluation	Ragas, TruLens	Deep retrieval scoring + groundedness
End-to-End LLM Testing	DeepEval, LangSmith	Unit tests + scenario tests
Enterprise Observability & Tracing	LangSmith, W&B	Traces, metrics, debugging
Drift Detection	TruLens, W&B	Explainability + drift
CI/CD Integration	PromptFoo, DeepEval	Light-weight, YAML-based
Audit Compliance (Banking/Insurance)	TruLens	Explainable metrics

For BFSI / large-scale AI projects, the best combination is:

LangSmith + TruLens + Ragas + PromptFoo

Why:

LangSmith → Multi-agent traces + prompt testing
Ragas → Retrieval quality (Recall@k, precision, context relevance)
TruLens → Hallucination & groundedness
PromptFoo → Prompt versioning & CI/CD testing

This gives you a complete AI testing ecosystem.

AI Engineering Best Practices

✅ 1. Prompt Versioning

Why enterprises need it

Best practices

What I teach teams

✅ 2. Evaluation (LLM Eval Harness)

What we evaluate

How we evaluate

Enterprise example

✅ 3. Retrieval Tuning (RAG Optimization)

What to tune

Best practices I teach

Example

✅ 4. Logging & Observability

What to log

Why enterprises need this

What I teach teams

✅ 5. Testing Framework

Types of tests

✔ Prompt Unit Tests

✔ RAG Regression Tests

✔ Tool Integration Tests

✔ Safety Tests

✔ Load Tests

Best practices

✅ Top AI Testing Frameworks (Industry Standard)

1️⃣ LangSmith (from LangChain) — Most popular for LLM evals

2️⃣ TruLens (from TruEra) — Enterprise-grade RAG evaluation

3️⃣ DeepEval — Testing-first framework for LLM unit tests

4️⃣ Ragas (by ExplodingGradients) — RAG-specific evaluation

5️⃣ PromptFoo — Great for prompt versioning + regression testing

6️⃣ OpenAI Evals — Baseline evaluation harness

7️⃣ Weights & Biases (W&B) — LLMOps, experiment tracking

8️⃣ MLflow + Custom LLM Testing Pipelines

🎯 Summary Table — Best Tool for Each Area

LangSmith + TruLens + Ragas + PromptFoo

Recent Posts

Comments