EA-Multi Cloud

Anand Nerurkar
Oct 22, 2025
10 min read

Enterprise Architect Case Study — Multi-Cloud Banking Modernization (Azure / AWS / GCP + On-Prem)

Below is a realistic, interview-ready case study you can present as your own work. It walks start to finish — board kickoff → business & SME workshops → initiation → design → migration → delivery → operations — and covers your role as Enterprise Architect, the roadmap, how success was tracked (KPIs), how you engaged delivery, the operating model & governance, principles & standards, and top 50 enterprise risks with mitigations across business / technology / application / data / security / governance / people.

Use this narrative in the interview: speak in the past tense (“I did X”) and point to concrete artifacts (capability map, target architecture, migration runbooks, ARB minutes) that you “produced/owned”.

1. Executive summary (elevator pitch)

FinEdge Bank (hypothetical) needed to modernize its digital banking and lending platforms to reduce time-to-market, lower technical debt, improve resilience, and meet regulatory requirements (data residency, audit). I led the enterprise architecture effort to design and deliver a cloud-native, multi-cloud (Azure primary for retail channels, AWS for core transactional workloads, GCP for AI/ML), with on-prem integration for legacy systems and for data residency-sensitive stores. Outcome targets: API velocity x3, 99.99% availability for customer-facing services, and secure, auditable data residency.

2. My role (Enterprise Architect) — responsibilities I executed

Served as Lead Enterprise Architect and Program Architecture Owner reporting to CTO & Program Board.
Ran the architecture discovery, produced the target architecture, defined standards & reference patterns, and chaired the Architecture Review Board (ARB).
Created the migration strategy & roadmap, prioritized work with business stakeholders, and approved solution designs.
Embedded with delivery squads as the architecture escalation point, ensured non-functional requirements (NFRs) were satisfied.
Owned architecture governance, compliance with RBI/SEBI (data residency, audit trails), vendor + tool selection, and cross-cloud networking/security design.
Tracked KPIs and conducted gates for each phase to the Program Board.

3. End-to-end engagement — step-by-step walkthrough

Phase 0 — Board kickoff & business alignment (week 0)

Convened a Board briefing: presented pain points, high level cost/benefit, regulatory considerations, and a recommended multi-cloud strategy.
Secured sponsorship, budget envelope, and steering committee members (CIO, CISO, Head Retail, Head Lending, Head Ops, CFO).

Artifacts: Board slide deck, Business case, High-level ROI and risk summary.

Phase 1 — Initiation & discovery (weeks 1–6)

Conducted business workshops with domain owners (Retail, Lending, Payments, CRM) to capture pain points, SLAs, regulatory constraints, and roadmaps.
Ran technical discovery: inventory of legacy systems, interfaces, data stores, batch jobs, security controls, and infra topology.
Performed Application Rationalization: classify apps (rewrite, rehost, retire, replace) using T-shirt sizing + risk & dependency map.
Established an Architecture Working Group: solution architects, cloud architects (Azure/AWS/GCP), security, infra, compliance, PMO.

Artifacts: Capability map, Application portfolio, Dependency graphs, Initial security & compliance constraints.

Phase 2 — Target architecture & principles (weeks 4–10, overlapping)

Defined target state architecture: channel gateway → API layer → domain microservices → event fabric → data mesh + analytics.
Chose cloud specialization: Azure for retail & frontend (Front Door, AKS), AWS for core transactional (EKS, RDS Aurora), GCP for ML/analytics (BigQuery, Vertex AI), on-prem for legacy core and regulated PII vaults.
Set integration pattern: event-driven backbone (Kafka across clouds via MirrorMaker/managed event bridging), API Gateway federation, and secure service mesh across EKS/AKS clusters (Istio/Linkerd).
Defined non-functional targets (SLA, RTO/RPO, latency budgets, throughput).

Artifacts: Target architecture diagrams (logical, deployment, data flow), architecture principles, NFR catalogue.

Phase 3 — Roadmap & migration strategy (weeks 8–14)

Built phased migration roadmap: foundation (landing zones, identity, network), core banking carve-outs, channel modernization, analytics enablement, cutover & sunsetting.
Adopted Strangler Fig approach for legacy apps: expose legacy functionality via APIs and incrementally replace.
Defined migration waves and sprint-level milestones with rollback/runbook plans.

Artifacts: 12–18 month roadmap, migration waves, cutover runbooks, rollback plans.

Phase 4 — Platform foundation & governance (weeks 10–20)

Implemented multi-cloud landing zones using Terraform + Terragrunt templates: baseline networking, identity federation, logging, monitoring, and security guardrails.
Enabled centralized secrets & key management (HashiCorp Vault with HSM-backed KMS in each cloud).
Automated policy as code using OPA + CI checks and implemented CI/CD pipelines (GitOps with ArgoCD + cross-cloud pipelines).
Set up observability platform: OpenTelemetry traces, Prometheus metrics (federated), centralized logging in Elastic + long-term storage in object stores.

Artifacts: Landing zone code, policy library, GitOps repos, observability playbooks.

Phase 5 — Domain modernization & delivery engagement (months 6–15)

Formed triple-track delivery: Platform team (landing zones, infra), Core Domain squads (Accounts, Payments, Loans, KYC), and Integration/Legacy squad.
I reviewed and approved solution designs for each squad (data model, APIs, events, SLOs).
Implemented Temporal for long-running lending workflows, Kafka for real-time eventing, and DAPR in sidecars for portability of service invocation patterns.
Enforced security by design: threat models for every service, dependency scans, SAST/DAST in pipelines.

Artifacts: Solution architecture docs, service blueprints, threat model outputs, release checklists.

Phase 6 — Cutover, validation & handover (months 12–18)

Performed canary and blue/green deployments, synthetic transaction testing, and chaos exercises to validate resilience.
Executed data migration in waves: parallel writes → sync → cutover → retire. For PII, used secure VPN and transfer logs and auditor verification.
Completed knowledge transfer and handed operations to the run team with runbooks, runbooks, and runbooks.

Artifacts: Cutover logs, runbooks, test results, ARB sign-offs.

Phase 7 — Operate & optimize (post-go live)

Established FinOps for cost monitoring, SLOs review cadence, capacity planning.
Continuous improvement via Architecture Clinics, monthly ARB, quarterly business reviews.

4. How I created the roadmap (method & artifacts)

Inputs: Business priorities, compliance timeline, application inventory, technical debt assessment, dependency map.
Prioritization: Weighted scoring model (business value, risk reduction, complexity, regulatory urgency).
Wave planning: Grouped by dependency risk and business value so high-impact low-risk items went early (quick wins) and high-risk transformations scheduled later with more prep.
Timebox: 3-month sprints grouped into 4–6 waves over 12–18 months with milestones and go/no-go gates.
Validation: Steering committee reviews each wave against KPIs before moving to the next.

Artifacts: Weighted prioritization matrix, migration waves, milestone Gantt.

5. KPIs used to track success (program & platform level)

Time to market: Mean time from idea to production (target reduce from 6 months to 4–6 weeks).
Deployment frequency: # of production deploys per service per month.
Lead time for changes: average PR to production time.
Service availability (SLA): % uptime (target 99.99% for customer-facing).
Error budgets / SLO compliance: % services within SLO.
Latency / P95 & P99: end-to-end API latency targets.
Mean Time To Recovery (MTTR): time to restore a service.
Security posture metrics: number of open critical vulnerabilities, DAST/SAST pass rates.
Data residency compliance incidents: count = 0 target.
Cost efficiency: Cloud spend vs budget, cost per transaction.
Developer productivity: PR cycle time, build failures rate.
Customer experience: NPS for digital channels, number of failed transactions.

KPIs were reported weekly (ops) and monthly (program) to the steering committee.

6. Operating model & governance I set up

Operating model: Federated architecture model:
- Central Platform Team: manages landing zones, common services, policy, observability.
- Domain Squads: own business capabilities; responsible for code, APIs, and domain data.
- Integration & Legacy Squad: handles adapters, strangler patterns, and cutovers.
- Security & Compliance Team: centralized policy, risk assessments, audits.
Governance bodies:
- Steering Committee (monthly) — strategic decisions, funding, escalations.
- Architecture Review Board (bi-weekly) — approves designs and NFR acceptances.
- Security CAB (as needed) — approves high-risk changes.
- Cloud Cost Council (monthly) — FinOps decisions.
Gate process: Design → Security review → Performance test → ARB sign-off → Release.

RACI model and runbooks were published for every deliverable.

7. Principles & standards I defined

Principles

API-first (all new capabilities exposed as versioned APIs)
Event-driven for eventual consistency & loose coupling
Cloud-agnostic code; cloud-specific infra templates only in infra layer
Secure by default (least privilege, encryption at rest/in transit)
Observability by default (traces, metrics, logs)
Data residency & privacy first for regulated data
Automate everything (infra, policy, tests)
Failure as expected (design for graceful degradation)

Standards

API spec: OpenAPI 3 + contract testing required.
Event schema: Avro/Protobuf with schema registry and versioning.
CI/CD: pipelines gated, unit/integration tests, SAST/DAST, and policy checks.
Secrets: HashiCorp Vault, no static secrets checked into repos.
Identity: SSO/SAML/OIDC; cross-cloud federation using central identity provider.
Logging & tracing: OpenTelemetry, trace context propagation.
Data encryption: AES-256 with cloud KMS keys and HSM for key custodianship.
Network: private subnets for backend; centralized VPC/VNet peering with transit gateway / hub-spoke model.

8. How I engaged delivery to ensure outcomes

Embedded presence: attended squad planning, refinements, and retrospectives weekly.
Design clinic: weekly drop-in for architects & devs to validate designs.
Reviewer role: enforced ARB sign-offs before major merges.
Architecture mentorship: office hours to uplift delivery architects on cloud patterns.
Delivery KPIs: tied to SLOs and security gates; delivery leads reported performance monthly.
Escalation path: defined for blockers, with on-call architects for incident RCA.

9. Top 50 risks with mitigation (organized by category)

Each line: Risk — Mitigation(s)

Business / Program risks

Insufficient executive sponsorship — Maintain monthly steering updates, show ROI metrics and quick wins.
Budget overruns — Phased funding, contingency, FinOps governance, and monthly cost reviews.
Scope creep — Strict change control via CAB; prioritize via weighted scoring.
Business resistance to change — Stakeholder workshops, change champions, training, and early demos.
Incorrect prioritization of work — Use Weighted Shortest Job First (WSJF) + executive validation.
Regulatory change mid-program — Regulatory watch, legal engagement, flexible designs, buffer in roadmap.
Vendor lock-in concerns — Use cloud-agnostic patterns and abstractions; multi-cloud testing.
Inadequate BAU support planning — Early run team involvement and handover playbooks.
Failure to meet customer expectations — Early MVP releases, user testing, telemetry on CX.
Poor supplier/vendor performance — SLA clauses, regular performance reviews, backup vendors.

Technical / Infrastructure risks

Complex multi-cloud networking issues — Use hub-spoke transit architecture, dedicated network engineers, thorough testing in staging.
Cross-cloud data replication failure — Robust replication with checkpoints, idempotent transfer, verification, and retry logic.
Insufficient capacity planning — Load testing, autoscaling policies, capacity reserve.
Toolchain incompatibilities across clouds — Standardize on open tooling and Terraform modules; test modules per cloud early.
Platform stability problems — Blue/green deployments, canaries, chaos engineering.
Observability gaps — Standardize telemetry, mandatory trace/metric/log ingestion for all services.
Backup & restore failures — Automated backups, periodic recovery drills, documented RTO/RPO procedures.
DNS or global routing failures — Multi-region DNS failover, health checks, traffic manager.
Service mesh complexity causing latency — Careful sidecar tuning, eBPF acceleration where needed, SLI monitoring.
Kubernetes multi-cluster coordination issues — GitOps for config, centralized control plane for policies, cluster federation patterns.

Application / Integration risks

Legacy interfaces not stable — Build adapter contracts, implement consumer-driven contract tests, and strangler facade.
Data model mismatch across domains — Canonical data model patterns and domain event schemas with governance.
Eventual consistency causing business anomalies — Compensation workflows via Temporal, clear customer UX messaging.
Transactionality across microservices — Use sagas and idempotent operations; avoid distributed transactions where possible.
API versioning & compatibility breaks — Strict deprecation policy, semantic versioning and compatibility tests.
Message queue backpressure — Backpressure handling, DLQs, monitoring and autoscaling of consumers.
Batch job failures impacting downstream — Scheduling isolation, retries, circuit breakers and backfill processes.
Third-party API outages — Circuit breakers, cached fallbacks, SLAs and multi-vendor redundancy for critical services.
Data migration corruption — Pre-migration validation, checksums, trial migrations and reconciliation scripts.
Poor test coverage leading to production defects — Enforce coverage gates, contract tests and chaos tests.

Data & Privacy risks

PII leaking across borders — Data classification, geo-fencing, tokenization, pseudonymization, and enforced residency controls.
Unauthorized data access — RBAC, ABAC policies, audit logging, KMS and HSM controls.
Data lineage & traceability gaps — Data catalog, lineage tools, mandatory metadata on datasets.
Poor master data management (MDM) — MDM implementation, authoritative systems, reconciliation jobs.
Inconsistent retention & deletion controls — Automated retention enforcement, audit trails, GDPR/RBI compliance playbooks.
Analytics models using raw PII — Use pseudonymized datasets for ML; in-country compute for PII models.

Security risks

Identity compromise — Enforce MFA, short-lived tokens, OIDC federation, anomaly detection for privileged operations.
Weak key management — Use HSM, rotation policies, central key custody with separation of duties.
Unpatched vulnerabilities — Automated patching, vulnerability scanning, emergency patch runbooks.
Insider threat — Least privilege, Just-in-Time access, privileged access monitoring, background checks.
Supply chain compromise — SBOM, signed artifacts, controlled third-party approvals, reproducible builds.
Lack of threat detection — EDR, SIEM with tuned detection rules and playbooks.
Insecure defaults in cloud services — Hardened baseline templates, CIS benchmarks, automated drift detection.
Exfiltration via storage/public buckets — Block public access defaults, storage policies, detection of anomalous egress.
DDoS & traffic attacks — WAF, CDNs, cloud DDoS protections, and rate limiting.

Governance / People / Process risks

Skill gaps in cloud technologies — Training programs, certifications, external consultants for ramp-up, pair programming.
Over-centralized decision making slows delivery — Federated governance with clear guardrails and delegated authority.
Poor documentation & knowledge silos — Documentation standards, internal wiki, rotating on-call and shadowing.
Team burnout & attrition — Realistic roadmaps, rotate on-call, recognition, hiring plan, succession.
Ineffective governance causing rework — Lightweight pragmatic governance, guardrails with automation, and review timeboxes.

10. Example artifacts & evidence I produced / owned

Board slide deck (business case + risk summary)
Capability map & application portfolio matrix
Target architecture (logical, physical, data flow diagrams)
Reference patterns + API & event standards (OpenAPI, Avro schemas)
Landing zone Terraform modules & GitOps repos
CI/CD pipeline templates + security policy as code (OPA)
Migration waves & cutover runbooks
SLA / SLO catalog and KPI dashboards (Grafana)
ARB meeting minutes, decision logs, and architecture exceptions register
Threat models & compliance audit evidence

11. How to present this in the interview — suggested talking points

Start with business context & sponsor buy-in: “We had board approval for X and needed measurable outcomes.”
Show one-page target architecture and explain where Azure/AWS/GCP and on-prem each fit and why.
Walk through one concrete migration wave (e.g., Loan Origination): discovery → API façade → parallel writes → cutover → validation. Mention tools (Kafka, Temporal, DAPR).
Highlight governance: ARB, policy as code, and gates. Give a real metric (e.g., cut lead time by X, improved availability to Y).
Discuss a major risk you mitigated (pick one from the 50) and step through actions and outcomes.
Close by summarizing how this aligns with the JD: multi-cloud experience, governance, DAPR/Temporal/Kafka familiarity, stakeholder leadership, and regulatory alignment.

12. Quick Q&A prep (examples you can expect)

“Why multi-cloud vs single cloud?” — Answer: risk diversification, best-of-breed services (AI in GCP, front door in Azure), regulatory locality, and vendor negotiation leverage. Add cost/operational tradeoffs and governance mitigations.
“How did you handle data residency?” — Answer: classification, geo-fencing, tokenization, in-country compute, and auditored transfer protocols.
“Why Temporal + Kafka?” — Answer: Kafka for event streaming and asynchronous comms; Temporal for orchestrating reliable, stateful long-running business workflows and compensations.
“How did you keep architects & dev teams aligned?” — Answer: ARB, design clinics, reference blueprints, and embedded architecture review cycles.