DevOps pipeline for Spring AI
- Anand Nerurkar
- Nov 26
- 6 min read
1️⃣ Core Principle (Very Important for Interview)
For Java-based Spring AI systems:
Spring AI → serves production traffic
PromptFlow / DeepEval / RAGAS → run as external evaluation workers
CI/CD orchestrates them as quality gates
No Python code runs inside the Java microservice
Think of these tools as:
“Post-deployment quality scanners for GenAI, not runtime dependencies.”
2️⃣ Where These Tools Sit in the Architecture
Git Push
|
v
CI Pipeline
|
+--> Build & Test Java (JUnit, Integration Tests)
|
+--> Deploy to AI-Dev Namespace (AKS)
|
+--> Run AI Evaluation Job (Python Pod)
|
+--> Calls Spring AI Endpoint
|
+--> Uses:
- PromptFlow / RAGAS / DeepEval
|
+--> Publishes Evaluation Metrics
|
+--> Quality Gate (pass/fail)
|
+--> Promote to Staging / Prod
✅ Java never imports these libraries✅ They act like black-box testers against your REST API
3️⃣ Tool-by-Tool Usage in a Java CI/CD Pipeline
✅ 1. PromptFlow (Microsoft)
What it is used for:
End-to-end prompt + RAG flow testing
Variant comparison (prompt v1 vs v2)
Dataset-driven batch evaluation
Azure-native governance
How it is used with Spring AI:
You deploy Spring AI RAG endpoint to Dev AKS
PromptFlow calls your REST API
It evaluates:
Answer quality
Groundedness
Relevance
Safety
Results go to:
Azure ML Workspace
Blob Storage
DevOps pipeline logs
CI/CD YAML Pattern (Conceptual)
- stage: ai-evaluation
jobs:
- job: promptflow_eval
container: mcr.microsoft.com/azureml/promptflow
steps:
- script: |
pf run create \
--flow policy_qna \
--data golden_dataset.json \
--run-name spring-ai-eval
✅ Gate deployment based on PromptFlow scores✅ Works perfectly with Spring AI
✅ 2. RAGAS (For RAG Quality)
What it evaluates:
Context Recall
Faithfulness
Answer Relevance
Context Precision
How it integrates with Java:
RAGAS container calls:
POST /api/rag/ask
You store:
Question
Retrieved context
Answer
RAGAS computes metrics externally
Scores are pushed to:
S3/Blob
Or as CI pipeline variables
Python Eval Job (External)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy]
)
if result["faithfulness"] < 0.85:
raise Exception("Quality gate failed")
✅ Java just exposes API✅ Python job validates quality✅ Pipeline decides promote/rollback
✅ 3. DeepEval (Hallucination & Guardrails)
What it is used for:
Hallucination detection
Toxicity checks
Factual grounding
Regression testing on prompts
How it works with Spring AI:
DeepEval treats your Java service as a black-box API:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric
test_case = LLMTestCase(
input="What is RBI KYC rule?",
actual_output=call_spring_ai(),
expected_output="RBI requires..."
)
metric = HallucinationMetric()
evaluate([test_case], [metric])
✅ If hallucination score crosses threshold → pipeline fails✅ No Java changes required
4️⃣ Full CI/CD Pipeline for Java + AI Evaluation (Realistic)
✅ Stage 1 — Build & Unit Test (Java)
Maven/Gradle build
JUnit + Mockito
Static analysis (SonarQube)
✅ Stage 2 — Integration Test
Testcontainers for Postgres/pgvector
SpringBootTest + real RAG flow (mock LLM)
✅ Stage 3 — Deploy to Dev AKS
Docker build → ACR
Helm deploy spring-ai-rag-dev
✅ Stage 4 — AI Evaluation Stage (Python)
Spin up ephemeral Python job
Run:
PromptFlow
RAGAS
DeepEval
Call Java endpoints
Compute:
Faithfulness
Relevance
Hallucination rate
Toxicity
Publish scores as pipeline variables
✅ Stage 5 — Quality Gate
If hallucination_rate > 2%
OR faithfulness < 0.85
THEN fail pipeline
✅ Stage 6 — Promote to Staging / Prod
Only if all AI metrics pass
5️⃣ How This Looks in AKS (Production Pattern)
spring-ai-rag (JAVA POD)
prompt-eval-job (PYTHON POD)
otel-collector
vector-db
Python eval pods are:
Short-lived
Run only on pipeline trigger
Java pods are:
Long-running production services
This separation is mandatory in regulated BFSI.
6️⃣ What You Should Say in Interview (Perfect Answer)
You can safely say:
“Although we use Spring AI in Java for production runtime, for LLM quality evaluation we use Python-based tools like PromptFlow, RAGAS, and DeepEval as external CI/CD evaluation jobs. These tools call the Spring AI endpoints as black-box APIs, compute metrics like faithfulness, context recall, and hallucination rate using golden datasets, and then act as quality gates before promoting builds to staging or production. The Java runtime remains clean and vendor-neutral, while AI evaluation evolves independently.”
This answer shows:✅ Platform maturity✅ Regulatory alignment✅ Proper separation of concerns✅ Real production experience
7️⃣ When Each Tool Is Best (Quick Matrix)
Tool | Best For | Use in Java Runtime? |
PromptFlow | End-to-end flow testing & A/B | ❌ External job |
RAGAS | RAG accuracy & grounding | ❌ External job |
DeepEval | Hallucination / safety | ❌ External job |
JUnit | Java logic testing | ✅ Inside runtime |
OpenTelemetry | Runtime observability | ✅ Inside runtime |
✅ Final Summary
“In a Java-based Spring AI system, PromptFlow, RAGAS, and DeepEval are never embedded in the runtime; they are executed as independent Python evaluation jobs in the CI/CD pipeline that call the Spring AI APIs and act as automated quality gates before production promotion.”
Python Ecosystem
====
✅ End-to-End GenAI Lifecycle — Python Ecosystem (LangChain / LangGraph)
I’ll walk it in exact enterprise order:
Requirements
Design
Build
Test (AI + software)
CI/CD
Deploy on AKS
Monitor & Observe
Evaluate & Optimize
Use case example: Policy / Regulatory Q&A using RAG.
1️⃣ Requirements Phase (AI-Native)
Business
Use case: Policy Q&A, Loan underwriting, Fraud agent, etc.
Target users: Ops, Compliance, Customers
KPIs: Accuracy > 90%, Hallucination < 2%, Latency < 800ms
Data
PDFs, DOCs, Scans → OCR needed
Data sensitivity: PII, PCI, Regulatory docs
Governance
Human-in-the-loop
Full audit logging
Explainability (retrieved context + answer)
✅ Deliverables
Use case definition
Risk register
Golden dataset (Q → Expected A)
RAG design decision (chunking, embedding, vector DB)
2️⃣ Design Phase (Python Architecture)
Client
↓
Azure APIM
↓
FastAPI (Python)
↓
LangChain / LangGraph
↓
Retriever (pgvector / Azure Search)
↓
Azure OpenAI / Local LLM
Core Services
api-service → FastAPI
agent-service → LangChain/LangGraph
ingestion-service → OCR + Chunk + Embed
evaluation-service → PromptFlow/RAGAS/DeepEval (offline)
Security
Azure AD / OAuth
Key Vault for secrets
Private endpoints for LLM & DB
3️⃣ Build Phase (Python Toolchain)
Runtime Stack
FastAPI – REST API
LangChain / LangGraph – orchestration
Pydantic – request validation
SQLAlchemy – metadata DB
Redis – cache
pgvector / Azure Search – vector store
Prompt Management
Prompts as code in Git
Versioning: prompt_v1.md, prompt_v2.md
Optional prompt registry in Postgres
Ingestion Pipeline
Azure Form Recognizer / Tesseract OCR
Text splitter (LangChain)
Embedding model
Vector DB upsert
Metadata tagging
4️⃣ Testing Phase (Python-Specific)
This is where PromptFlow, RAGAS, DeepEval come in.
✅ A. Traditional Software Testing
Unit tests: pytest, pytest-mock
API tests: FastAPI TestClient
Integration tests: testcontainers-python (pgvector, redis)
Contract tests: schemathesis or pact
✅ B. Prompt & LLM Evaluation Tools (Key Part)
These tools do NOT replace pytest.They test AI quality, not code.
🔹 1. PromptFlow (End-to-End Flow Testing)
Used for:
Full RAG flow validation
Prompt comparison (v1 vs v2)
Batch testing using datasets
A/B testing
How it works:
PromptFlow calls your FastAPI RAG endpoint
Uses datasets of questions
Evaluates output against metrics
When used:
Pre-prod validation
Model/prompt A/B testing
🔹 2. RAGAS (RAG Quality Metrics)
Used for:
Context Recall
Faithfulness
Answer Relevance
Context Precision
How it works:
You log:
Question
Retrieved chunks
Final answer
RAGAS calculates RAG accuracy scores
When used:
Nightly batch evaluation
Regression on retriever + chunking changes
🔹 3. DeepEval (Safety & Hallucination)
Used for:
Hallucination detection
Toxicity/safety checks
Prompt regression
When used:
CI quality gates
Guardrail validation before prod
5️⃣ CI/CD Pipeline (Python + GenAI)
✅ Typical Pipeline Stages
Code Quality
ruff / flake8 / black
Unit Tests
pytest
Build Image
Docker → ACR
Deploy to Dev AKS
Helm / ArgoCD
AI Evaluation Stage (Python Jobs)
PromptFlow
RAGAS
DeepEval
Quality Gate
Fail if hallucination > threshold
Promote to Staging/Prod
✅ How the AI Evaluation Stage Works
These tools run as short-lived Python jobs:
CI Pipeline
↓
Spin up Python Eval Pod
↓
Call FastAPI RAG Endpoints
↓
Run PromptFlow / RAGAS / DeepEval
↓
Export Scores
↓
Gate Deployment
They never run inside the production FastAPI container.
6️⃣ Deployment on AKS (Python)
Runtime
Dockerized FastAPI + LangChain
Gunicorn + Uvicorn workers
Horizontal Pod Autoscaling (HPA)
Secrets
Azure Key Vault → Kubernetes Secrets
No hardcoded API keys
Networking
Private AKS
Ingress + WAF
Azure APIM at edge
7️⃣ Observability & Monitoring (Python)
✅ Infrastructure Observability
OpenTelemetry
Prometheus
Grafana
Azure Application Insights
✅ LLM Observability (Optional but Powerful)
LangSmith
Prompt traces
Token usage
Agent graphs
Tool calls
✅ Security Monitoring
Audit logs (who asked what)
Prompt & response retention
PII masking logs
8️⃣ Optimization & Continuous Improvement
RAG Optimization
Chunk size tuning
Re-embedding strategy
Reranker tuning
Model Optimization
Model routing (cheap vs premium)
Temperature control
Context window optimization
Business Optimization
Token cost tracking
SLA-based scaling
Human feedback loop
✅ How PromptFlow, RAGAS, DeepEval Fit Together (Python)
Tool | Purpose | When Used |
PromptFlow | End-to-end flow testing, A/B | Pre-prod, UAT |
RAGAS | RAG retrieval quality | Nightly batch |
DeepEval | Hallucination & safety | CI quality gate |
pytest | Code correctness | Every commit |
✅
“In our Python GenAI stack, we use FastAPI with LangChain and LangGraph, containerized and deployed on AKS. Traditional testing is handled through pytest and Testcontainers. For AI-specific validation, we use PromptFlow for end-to-end RAG flow testing and prompt A/B comparison, RAGAS for retrieval quality metrics like faithfulness and context recall, and DeepEval for hallucination and safety validation. These tools run as independent Python evaluation jobs inside the CI/CD pipeline, act as automated quality gates, and never run inside the production runtime. Observability is handled via OpenTelemetry and Azure Application Insights, with optional LangSmith for deep LLM tracing.”
✅
“Software tests validate code correctness, but PromptFlow, RAGAS, and DeepEval validate model and prompt behavior. We treat them as first-class CI quality gates just like unit tests.”
This shows true AI-native engineering maturity.
✅ Ultra-Short Cheat Sheet
Dev: FastAPI + LangChain + LangGraph
Test: pytest + PromptFlow + RAGAS + DeepEval
CI/CD: GitHub Actions / Azure DevOps
Deploy: Docker → ACR → AKS
Observe: OpenTelemetry + AppInsights + (optional LangSmith)
Optimize: prompt tuning + retriever tuning + model routing
.png)

Comments