DevOps pipeline for Spring AI

Anand Nerurkar
Nov 26, 2025
6 min read

Updated: Dec 19, 2025

1️⃣ Core Principle (Very Important for Interview)

For Java-based Spring AI systems:

Spring AI → serves production traffic
PromptFlow / DeepEval / RAGAS → run as external evaluation workers
CI/CD orchestrates them as quality gates
No Python code runs inside the Java microservice

Think of these tools as:

“Post-deployment quality scanners for GenAI, not runtime dependencies.”

2️⃣ Where These Tools Sit in the Architecture

Git Push
   |
   v
CI Pipeline
   |
   +--> Build & Test Java (JUnit, Integration Tests)
   |
   +--> Deploy to AI-Dev Namespace (AKS)
   |
   +--> Run AI Evaluation Job (Python Pod)
           |
           +--> Calls Spring AI Endpoint
           |
           +--> Uses:
               - PromptFlow / RAGAS / DeepEval
           |
           +--> Publishes Evaluation Metrics
   |
   +--> Quality Gate (pass/fail)
   |
   +--> Promote to Staging / Prod

✅ Java never imports these libraries✅ They act like black-box testers against your REST API

3️⃣ Tool-by-Tool Usage in a Java CI/CD Pipeline

✅ 1. PromptFlow (Microsoft)

What it is used for:

End-to-end prompt + RAG flow testing
Variant comparison (prompt v1 vs v2)
Dataset-driven batch evaluation
Azure-native governance

How it is used with Spring AI:

You deploy Spring AI RAG endpoint to Dev AKS
PromptFlow calls your REST API
It evaluates:
- Answer quality
- Groundedness
- Relevance
- Safety
Results go to:
- Azure ML Workspace
- Blob Storage
- DevOps pipeline logs

CI/CD YAML Pattern (Conceptual)

- stage: ai-evaluation
  jobs:
    - job: promptflow_eval
      container: mcr.microsoft.com/azureml/promptflow
      steps:
        - script: |
            pf run create \
              --flow policy_qna \
              --data golden_dataset.json \
              --run-name spring-ai-eval

✅ Gate deployment based on PromptFlow scores✅ Works perfectly with Spring AI

✅ 2. RAGAS (For RAG Quality)

What it evaluates:

Context Recall
Faithfulness
Answer Relevance
Context Precision

How it integrates with Java:

RAGAS container calls:
POST /api/rag/ask
You store:
- Question
- Retrieved context
- Answer
RAGAS computes metrics externally
Scores are pushed to:
- S3/Blob
- Or as CI pipeline variables

Python Eval Job (External)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

if result["faithfulness"] < 0.85:
    raise Exception("Quality gate failed")

✅ Java just exposes API✅ Python job validates quality✅ Pipeline decides promote/rollback

✅ 3. DeepEval (Hallucination & Guardrails)

What it is used for:

Hallucination detection
Toxicity checks
Factual grounding
Regression testing on prompts

How it works with Spring AI:

DeepEval treats your Java service as a black-box API:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric

test_case = LLMTestCase(
    input="What is RBI KYC rule?",
    actual_output=call_spring_ai(),
    expected_output="RBI requires..."
)

metric = HallucinationMetric()
evaluate([test_case], [metric])

✅ If hallucination score crosses threshold → pipeline fails✅ No Java changes required

4️⃣ Full CI/CD Pipeline for Java + AI Evaluation (Realistic)

✅ Stage 1 — Build & Unit Test (Java)

Maven/Gradle build
JUnit + Mockito
Static analysis (SonarQube)

✅ Stage 2 — Integration Test

Testcontainers for Postgres/pgvector
SpringBootTest + real RAG flow (mock LLM)

✅ Stage 3 — Deploy to Dev AKS

Docker build → ACR
Helm deploy spring-ai-rag-dev

✅ Stage 4 — AI Evaluation Stage (Python)

Spin up ephemeral Python job
Run:
- PromptFlow
- RAGAS
- DeepEval
Call Java endpoints
Compute:
- Faithfulness
- Relevance
- Hallucination rate
- Toxicity
Publish scores as pipeline variables

✅ Stage 5 — Quality Gate

If hallucination_rate > 2%
OR faithfulness < 0.85
THEN fail pipeline

✅ Stage 6 — Promote to Staging / Prod

Only if all AI metrics pass

5️⃣ How This Looks in AKS (Production Pattern)

spring-ai-rag (JAVA POD)
prompt-eval-job (PYTHON POD)
otel-collector
vector-db

Python eval pods are:
- Short-lived
- Run only on pipeline trigger
Java pods are:
- Long-running production services

This separation is mandatory in regulated BFSI.

6️⃣ Summary

You can safely say:

“Although we use Spring AI in Java for production runtime, for LLM quality evaluation we use Python-based tools like PromptFlow, RAGAS, and DeepEval as external CI/CD evaluation jobs. These tools call the Spring AI endpoints as black-box APIs, compute metrics like faithfulness, context recall, and hallucination rate using golden datasets, and then act as quality gates before promoting builds to staging or production. The Java runtime remains clean and vendor-neutral, while AI evaluation evolves independently.”

This answer shows:✅ Platform maturity✅ Regulatory alignment✅ Proper separation of concerns✅ Real production experience

7️⃣ When Each Tool Is Best (Quick Matrix)

Tool	Best For	Use in Java Runtime?
PromptFlow	End-to-end flow testing & A/B	❌ External job
RAGAS	RAG accuracy & grounding	❌ External job
DeepEval	Hallucination / safety	❌ External job
JUnit	Java logic testing	✅ Inside runtime
OpenTelemetry	Runtime observability	✅ Inside runtime

✅ Final Summary

“In a Java-based Spring AI system, PromptFlow, RAGAS, and DeepEval are never embedded in the runtime; they are executed as independent Python evaluation jobs in the CI/CD pipeline that call the Spring AI APIs and act as automated quality gates before production promotion.”

Python Ecosystem

====

✅ End-to-End GenAI Lifecycle — Python Ecosystem (LangChain / LangGraph)

I’ll walk it in exact enterprise order:

Requirements
Design
Build
Test (AI + software)
CI/CD
Deploy on AKS
Monitor & Observe
Evaluate & Optimize

Use case example: Policy / Regulatory Q&A using RAG.

1️⃣ Requirements Phase (AI-Native)

Business

Use case: Policy Q&A, Loan underwriting, Fraud agent, etc.
Target users: Ops, Compliance, Customers
KPIs: Accuracy > 90%, Hallucination < 2%, Latency < 800ms

Data

PDFs, DOCs, Scans → OCR needed
Data sensitivity: PII, PCI, Regulatory docs

Governance

Human-in-the-loop
Full audit logging
Explainability (retrieved context + answer)

✅ Deliverables

Use case definition
Risk register
Golden dataset (Q → Expected A)
RAG design decision (chunking, embedding, vector DB)

2️⃣ Design Phase (Python Architecture)

Client
  ↓
Azure APIM
  ↓
FastAPI (Python)
  ↓
LangChain / LangGraph
  ↓
Retriever (pgvector / Azure Search)
  ↓
Azure OpenAI / Local LLM

Core Services

api-service → FastAPI
agent-service → LangChain/LangGraph
ingestion-service → OCR + Chunk + Embed
evaluation-service → PromptFlow/RAGAS/DeepEval (offline)

Security

Azure AD / OAuth
Key Vault for secrets
Private endpoints for LLM & DB

3️⃣ Build Phase (Python Toolchain)

Runtime Stack

FastAPI – REST API
LangChain / LangGraph – orchestration
Pydantic – request validation
SQLAlchemy – metadata DB
Redis – cache
pgvector / Azure Search – vector store

Prompt Management

Prompts as code in Git
Versioning: prompt_v1.md, prompt_v2.md
Optional prompt registry in Postgres

Ingestion Pipeline

Azure Form Recognizer / Tesseract OCR
Text splitter (LangChain)
Embedding model
Vector DB upsert
Metadata tagging

4️⃣ Testing Phase (Python-Specific)

This is where PromptFlow, RAGAS, DeepEval come in.

✅ A. Traditional Software Testing

Unit tests: pytest, pytest-mock
API tests: FastAPI TestClient
Integration tests: testcontainers-python (pgvector, redis)
Contract tests: schemathesis or pact

✅ B. Prompt & LLM Evaluation Tools (Key Part)

These tools do NOT replace pytest.They test AI quality, not code.

🔹 1. PromptFlow (End-to-End Flow Testing)

Used for:

Full RAG flow validation
Prompt comparison (v1 vs v2)
Batch testing using datasets
A/B testing

How it works:

PromptFlow calls your FastAPI RAG endpoint
Uses datasets of questions
Evaluates output against metrics

When used:

Pre-prod validation
Model/prompt A/B testing

🔹 2. RAGAS (RAG Quality Metrics)

Used for:

Context Recall
Faithfulness
Answer Relevance
Context Precision

How it works:

You log:
- Question
- Retrieved chunks
- Final answer
RAGAS calculates RAG accuracy scores

When used:

Nightly batch evaluation
Regression on retriever + chunking changes

🔹 3. DeepEval (Safety & Hallucination)

Used for:

Hallucination detection
Toxicity/safety checks
Prompt regression

When used:

CI quality gates
Guardrail validation before prod

5️⃣ CI/CD Pipeline (Python + GenAI)

✅ Typical Pipeline Stages

Code Quality
- ruff / flake8 / black
Unit Tests
- pytest
Build Image
- Docker → ACR
Deploy to Dev AKS
- Helm / ArgoCD
AI Evaluation Stage (Python Jobs)
- PromptFlow
- RAGAS
- DeepEval
Quality Gate
- Fail if hallucination > threshold
Promote to Staging/Prod

✅ How the AI Evaluation Stage Works

These tools run as short-lived Python jobs:

CI Pipeline
   ↓
Spin up Python Eval Pod
   ↓
Call FastAPI RAG Endpoints
   ↓
Run PromptFlow / RAGAS / DeepEval
   ↓
Export Scores
   ↓
Gate Deployment

They never run inside the production FastAPI container.

6️⃣ Deployment on AKS (Python)

Runtime

Dockerized FastAPI + LangChain
Gunicorn + Uvicorn workers
Horizontal Pod Autoscaling (HPA)

Secrets

Azure Key Vault → Kubernetes Secrets
No hardcoded API keys

Networking

Private AKS
Ingress + WAF
Azure APIM at edge

7️⃣ Observability & Monitoring (Python)

✅ Infrastructure Observability

OpenTelemetry
Prometheus
Grafana
Azure Application Insights

✅ LLM Observability (Optional but Powerful)

LangSmith
- Prompt traces
- Token usage
- Agent graphs
- Tool calls

✅ Security Monitoring

Audit logs (who asked what)
Prompt & response retention
PII masking logs

8️⃣ Optimization & Continuous Improvement

RAG Optimization

Chunk size tuning
Re-embedding strategy
Reranker tuning

Model Optimization

Model routing (cheap vs premium)
Temperature control
Context window optimization

Business Optimization

Token cost tracking
SLA-based scaling
Human feedback loop

✅ How PromptFlow, RAGAS, DeepEval Fit Together (Python)

Tool	Purpose	When Used
PromptFlow	End-to-end flow testing, A/B	Pre-prod, UAT
RAGAS	RAG retrieval quality	Nightly batch
DeepEval	Hallucination & safety	CI quality gate
pytest	Code correctness	Every commit

✅

“In our Python GenAI stack, we use FastAPI with LangChain and LangGraph, containerized and deployed on AKS. Traditional testing is handled through pytest and Testcontainers. For AI-specific validation, we use PromptFlow for end-to-end RAG flow testing and prompt A/B comparison, RAGAS for retrieval quality metrics like faithfulness and context recall, and DeepEval for hallucination and safety validation. These tools run as independent Python evaluation jobs inside the CI/CD pipeline, act as automated quality gates, and never run inside the production runtime. Observability is handled via OpenTelemetry and Azure Application Insights, with optional LangSmith for deep LLM tracing.”

✅

“Software tests validate code correctness, but PromptFlow, RAGAS, and DeepEval validate model and prompt behavior. We treat them as first-class CI quality gates just like unit tests.”

This shows true AI-native engineering maturity.

✅ Ultra-Short Cheat Sheet

Dev: FastAPI + LangChain + LangGraph
Test: pytest + PromptFlow + RAGAS + DeepEval
CI/CD: GitHub Actions / Azure DevOps
Deploy: Docker → ACR → AKS
Observe: OpenTelemetry + AppInsights + (optional LangSmith)
Optimize: prompt tuning + retriever tuning + model routing

1️⃣ Core Principle (Very Important for Interview)

2️⃣ Where These Tools Sit in the Architecture

3️⃣ Tool-by-Tool Usage in a Java CI/CD Pipeline

✅ 1. PromptFlow (Microsoft)

What it is used for:

How it is used with Spring AI:

CI/CD YAML Pattern (Conceptual)

✅ 2. RAGAS (For RAG Quality)

What it evaluates:

How it integrates with Java:

Python Eval Job (External)

✅ 3. DeepEval (Hallucination & Guardrails)

What it is used for:

How it works with Spring AI:

4️⃣ Full CI/CD Pipeline for Java + AI Evaluation (Realistic)

✅ Stage 1 — Build & Unit Test (Java)

✅ Stage 2 — Integration Test

✅ Stage 3 — Deploy to Dev AKS

✅ Stage 4 — AI Evaluation Stage (Python)

✅ Stage 5 — Quality Gate

✅ Stage 6 — Promote to Staging / Prod

5️⃣ How This Looks in AKS (Production Pattern)

6️⃣ Summary

7️⃣ When Each Tool Is Best (Quick Matrix)

✅ Final Summary

✅ End-to-End GenAI Lifecycle — Python Ecosystem (LangChain / LangGraph)

1️⃣ Requirements Phase (AI-Native)

Business

Data

Governance

2️⃣ Design Phase (Python Architecture)

Core Services

Security

3️⃣ Build Phase (Python Toolchain)

Runtime Stack

Prompt Management

Ingestion Pipeline

4️⃣ Testing Phase (Python-Specific)

✅ A. Traditional Software Testing

✅ B. Prompt & LLM Evaluation Tools (Key Part)

🔹 1. PromptFlow (End-to-End Flow Testing)

🔹 2. RAGAS (RAG Quality Metrics)

🔹 3. DeepEval (Safety & Hallucination)

5️⃣ CI/CD Pipeline (Python + GenAI)

✅ Typical Pipeline Stages

✅ How the AI Evaluation Stage Works

6️⃣ Deployment on AKS (Python)

Runtime

Secrets

Networking

7️⃣ Observability & Monitoring (Python)

✅ Infrastructure Observability

✅ LLM Observability (Optional but Powerful)

✅ Security Monitoring

8️⃣ Optimization & Continuous Improvement

RAG Optimization

Model Optimization

Business Optimization

✅ How PromptFlow, RAGAS, DeepEval Fit Together (Python)

✅

✅

✅ Ultra-Short Cheat Sheet

Comments