top of page

DevOps pipeline for Spring AI

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Nov 26
  • 6 min read

1️⃣ Core Principle (Very Important for Interview)

For Java-based Spring AI systems:

  • Spring AI → serves production traffic

  • PromptFlow / DeepEval / RAGAS → run as external evaluation workers

  • CI/CD orchestrates them as quality gates

  • No Python code runs inside the Java microservice

Think of these tools as:

“Post-deployment quality scanners for GenAI, not runtime dependencies.”

2️⃣ Where These Tools Sit in the Architecture

Git Push
   |
   v
CI Pipeline
   |
   +--> Build & Test Java (JUnit, Integration Tests)
   |
   +--> Deploy to AI-Dev Namespace (AKS)
   |
   +--> Run AI Evaluation Job (Python Pod)
           |
           +--> Calls Spring AI Endpoint
           |
           +--> Uses:
               - PromptFlow / RAGAS / DeepEval
           |
           +--> Publishes Evaluation Metrics
   |
   +--> Quality Gate (pass/fail)
   |
   +--> Promote to Staging / Prod

Java never imports these libraries✅ They act like black-box testers against your REST API

3️⃣ Tool-by-Tool Usage in a Java CI/CD Pipeline

✅ 1. PromptFlow (Microsoft)

What it is used for:

  • End-to-end prompt + RAG flow testing

  • Variant comparison (prompt v1 vs v2)

  • Dataset-driven batch evaluation

  • Azure-native governance

How it is used with Spring AI:

  1. You deploy Spring AI RAG endpoint to Dev AKS

  2. PromptFlow calls your REST API

  3. It evaluates:

    • Answer quality

    • Groundedness

    • Relevance

    • Safety

  4. Results go to:

    • Azure ML Workspace

    • Blob Storage

    • DevOps pipeline logs

CI/CD YAML Pattern (Conceptual)

- stage: ai-evaluation
  jobs:
    - job: promptflow_eval
      container: mcr.microsoft.com/azureml/promptflow
      steps:
        - script: |
            pf run create \
              --flow policy_qna \
              --data golden_dataset.json \
              --run-name spring-ai-eval

✅ Gate deployment based on PromptFlow scores✅ Works perfectly with Spring AI

✅ 2. RAGAS (For RAG Quality)

What it evaluates:

  • Context Recall

  • Faithfulness

  • Answer Relevance

  • Context Precision

How it integrates with Java:

  1. RAGAS container calls:

    POST /api/rag/ask

  2. You store:

    • Question

    • Retrieved context

    • Answer

  3. RAGAS computes metrics externally

  4. Scores are pushed to:

    • S3/Blob

    • Or as CI pipeline variables

Python Eval Job (External)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

if result["faithfulness"] < 0.85:
    raise Exception("Quality gate failed")

✅ Java just exposes API✅ Python job validates quality✅ Pipeline decides promote/rollback

✅ 3. DeepEval (Hallucination & Guardrails)

What it is used for:

  • Hallucination detection

  • Toxicity checks

  • Factual grounding

  • Regression testing on prompts

How it works with Spring AI:

DeepEval treats your Java service as a black-box API:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric

test_case = LLMTestCase(
    input="What is RBI KYC rule?",
    actual_output=call_spring_ai(),
    expected_output="RBI requires..."
)

metric = HallucinationMetric()
evaluate([test_case], [metric])

✅ If hallucination score crosses threshold → pipeline fails✅ No Java changes required

4️⃣ Full CI/CD Pipeline for Java + AI Evaluation (Realistic)

✅ Stage 1 — Build & Unit Test (Java)

  • Maven/Gradle build

  • JUnit + Mockito

  • Static analysis (SonarQube)

✅ Stage 2 — Integration Test

  • Testcontainers for Postgres/pgvector

  • SpringBootTest + real RAG flow (mock LLM)

✅ Stage 3 — Deploy to Dev AKS

  • Docker build → ACR

  • Helm deploy spring-ai-rag-dev

✅ Stage 4 — AI Evaluation Stage (Python)

  • Spin up ephemeral Python job

  • Run:

    • PromptFlow

    • RAGAS

    • DeepEval

  • Call Java endpoints

  • Compute:

    • Faithfulness

    • Relevance

    • Hallucination rate

    • Toxicity

  • Publish scores as pipeline variables

✅ Stage 5 — Quality Gate

If hallucination_rate > 2%
OR faithfulness < 0.85
THEN fail pipeline

✅ Stage 6 — Promote to Staging / Prod

  • Only if all AI metrics pass

5️⃣ How This Looks in AKS (Production Pattern)

spring-ai-rag (JAVA POD)
prompt-eval-job (PYTHON POD)
otel-collector
vector-db
  • Python eval pods are:

    • Short-lived

    • Run only on pipeline trigger

  • Java pods are:

    • Long-running production services

This separation is mandatory in regulated BFSI.

6️⃣ What You Should Say in Interview (Perfect Answer)

You can safely say:

“Although we use Spring AI in Java for production runtime, for LLM quality evaluation we use Python-based tools like PromptFlow, RAGAS, and DeepEval as external CI/CD evaluation jobs. These tools call the Spring AI endpoints as black-box APIs, compute metrics like faithfulness, context recall, and hallucination rate using golden datasets, and then act as quality gates before promoting builds to staging or production. The Java runtime remains clean and vendor-neutral, while AI evaluation evolves independently.”

This answer shows:✅ Platform maturity✅ Regulatory alignment✅ Proper separation of concerns✅ Real production experience

7️⃣ When Each Tool Is Best (Quick Matrix)

Tool

Best For

Use in Java Runtime?

PromptFlow

End-to-end flow testing & A/B

❌ External job

RAGAS

RAG accuracy & grounding

❌ External job

DeepEval

Hallucination / safety

❌ External job

JUnit

Java logic testing

✅ Inside runtime

OpenTelemetry

Runtime observability

✅ Inside runtime


✅ Final Summary

“In a Java-based Spring AI system, PromptFlow, RAGAS, and DeepEval are never embedded in the runtime; they are executed as independent Python evaluation jobs in the CI/CD pipeline that call the Spring AI APIs and act as automated quality gates before production promotion.”

Python Ecosystem

====

✅ End-to-End GenAI Lifecycle — Python Ecosystem (LangChain / LangGraph)

I’ll walk it in exact enterprise order:

  1. Requirements

  2. Design

  3. Build

  4. Test (AI + software)

  5. CI/CD

  6. Deploy on AKS

  7. Monitor & Observe

  8. Evaluate & Optimize

Use case example: Policy / Regulatory Q&A using RAG.

1️⃣ Requirements Phase (AI-Native)

Business

  • Use case: Policy Q&A, Loan underwriting, Fraud agent, etc.

  • Target users: Ops, Compliance, Customers

  • KPIs: Accuracy > 90%, Hallucination < 2%, Latency < 800ms

Data

  • PDFs, DOCs, Scans → OCR needed

  • Data sensitivity: PII, PCI, Regulatory docs

Governance

  • Human-in-the-loop

  • Full audit logging

  • Explainability (retrieved context + answer)

Deliverables

  • Use case definition

  • Risk register

  • Golden dataset (Q → Expected A)

  • RAG design decision (chunking, embedding, vector DB)

2️⃣ Design Phase (Python Architecture)

Client
  ↓
Azure APIM
  ↓
FastAPI (Python)
  ↓
LangChain / LangGraph
  ↓
Retriever (pgvector / Azure Search)
  ↓
Azure OpenAI / Local LLM

Core Services

  • api-service → FastAPI

  • agent-service → LangChain/LangGraph

  • ingestion-service → OCR + Chunk + Embed

  • evaluation-service → PromptFlow/RAGAS/DeepEval (offline)

Security

  • Azure AD / OAuth

  • Key Vault for secrets

  • Private endpoints for LLM & DB

3️⃣ Build Phase (Python Toolchain)

Runtime Stack

  • FastAPI – REST API

  • LangChain / LangGraph – orchestration

  • Pydantic – request validation

  • SQLAlchemy – metadata DB

  • Redis – cache

  • pgvector / Azure Search – vector store

Prompt Management

  • Prompts as code in Git

  • Versioning: prompt_v1.md, prompt_v2.md

  • Optional prompt registry in Postgres

Ingestion Pipeline

  • Azure Form Recognizer / Tesseract OCR

  • Text splitter (LangChain)

  • Embedding model

  • Vector DB upsert

  • Metadata tagging

4️⃣ Testing Phase (Python-Specific)

This is where PromptFlow, RAGAS, DeepEval come in.

✅ A. Traditional Software Testing

  • Unit tests: pytest, pytest-mock

  • API tests: FastAPI TestClient

  • Integration tests: testcontainers-python (pgvector, redis)

  • Contract tests: schemathesis or pact

✅ B. Prompt & LLM Evaluation Tools (Key Part)

These tools do NOT replace pytest.They test AI quality, not code.

🔹 1. PromptFlow (End-to-End Flow Testing)

Used for:

  • Full RAG flow validation

  • Prompt comparison (v1 vs v2)

  • Batch testing using datasets

  • A/B testing

How it works:

  • PromptFlow calls your FastAPI RAG endpoint

  • Uses datasets of questions

  • Evaluates output against metrics

When used:

  • Pre-prod validation

  • Model/prompt A/B testing

🔹 2. RAGAS (RAG Quality Metrics)

Used for:

  • Context Recall

  • Faithfulness

  • Answer Relevance

  • Context Precision

How it works:

  • You log:

    • Question

    • Retrieved chunks

    • Final answer

  • RAGAS calculates RAG accuracy scores

When used:

  • Nightly batch evaluation

  • Regression on retriever + chunking changes

🔹 3. DeepEval (Safety & Hallucination)

Used for:

  • Hallucination detection

  • Toxicity/safety checks

  • Prompt regression

When used:

  • CI quality gates

  • Guardrail validation before prod

5️⃣ CI/CD Pipeline (Python + GenAI)

✅ Typical Pipeline Stages

  1. Code Quality

    • ruff / flake8 / black

  2. Unit Tests

    • pytest

  3. Build Image

    • Docker → ACR

  4. Deploy to Dev AKS

    • Helm / ArgoCD

  5. AI Evaluation Stage (Python Jobs)

    • PromptFlow

    • RAGAS

    • DeepEval

  6. Quality Gate

    • Fail if hallucination > threshold

  7. Promote to Staging/Prod

✅ How the AI Evaluation Stage Works

These tools run as short-lived Python jobs:

CI Pipeline
   ↓
Spin up Python Eval Pod
   ↓
Call FastAPI RAG Endpoints
   ↓
Run PromptFlow / RAGAS / DeepEval
   ↓
Export Scores
   ↓
Gate Deployment

They never run inside the production FastAPI container.

6️⃣ Deployment on AKS (Python)

Runtime

  • Dockerized FastAPI + LangChain

  • Gunicorn + Uvicorn workers

  • Horizontal Pod Autoscaling (HPA)

Secrets

  • Azure Key Vault → Kubernetes Secrets

  • No hardcoded API keys

Networking

  • Private AKS

  • Ingress + WAF

  • Azure APIM at edge

7️⃣ Observability & Monitoring (Python)

✅ Infrastructure Observability

  • OpenTelemetry

  • Prometheus

  • Grafana

  • Azure Application Insights

✅ LLM Observability (Optional but Powerful)

  • LangSmith

    • Prompt traces

    • Token usage

    • Agent graphs

    • Tool calls

✅ Security Monitoring

  • Audit logs (who asked what)

  • Prompt & response retention

  • PII masking logs

8️⃣ Optimization & Continuous Improvement

RAG Optimization

  • Chunk size tuning

  • Re-embedding strategy

  • Reranker tuning

Model Optimization

  • Model routing (cheap vs premium)

  • Temperature control

  • Context window optimization

Business Optimization

  • Token cost tracking

  • SLA-based scaling

  • Human feedback loop

✅ How PromptFlow, RAGAS, DeepEval Fit Together (Python)

Tool

Purpose

When Used

PromptFlow

End-to-end flow testing, A/B

Pre-prod, UAT

RAGAS

RAG retrieval quality

Nightly batch

DeepEval

Hallucination & safety

CI quality gate

pytest

Code correctness

Every commit

“In our Python GenAI stack, we use FastAPI with LangChain and LangGraph, containerized and deployed on AKS. Traditional testing is handled through pytest and Testcontainers. For AI-specific validation, we use PromptFlow for end-to-end RAG flow testing and prompt A/B comparison, RAGAS for retrieval quality metrics like faithfulness and context recall, and DeepEval for hallucination and safety validation. These tools run as independent Python evaluation jobs inside the CI/CD pipeline, act as automated quality gates, and never run inside the production runtime. Observability is handled via OpenTelemetry and Azure Application Insights, with optional LangSmith for deep LLM tracing.”

“Software tests validate code correctness, but PromptFlow, RAGAS, and DeepEval validate model and prompt behavior. We treat them as first-class CI quality gates just like unit tests.”

This shows true AI-native engineering maturity.

✅ Ultra-Short Cheat Sheet

  • Dev: FastAPI + LangChain + LangGraph

  • Test: pytest + PromptFlow + RAGAS + DeepEval

  • CI/CD: GitHub Actions / Azure DevOps

  • Deploy: Docker → ACR → AKS

  • Observe: OpenTelemetry + AppInsights + (optional LangSmith)

  • Optimize: prompt tuning + retriever tuning + model routing



 
 
 

Recent Posts

See All
How to replan- No outcome after 6 month

⭐ “A transformation program is running for 6 months. Business says it is not delivering the value they expected. What will you do?” “When business says a 6-month transformation isn’t delivering value,

 
 
 
EA Strategy in case of Merger

⭐ EA Strategy in Case of a Merger (M&A) My EA strategy for a merger focuses on four pillars: discover, decide, integrate, and optimize.The goal is business continuity + synergy + tech consolidation. ✅

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page