Capacity Planning-Digital Banking

Anand Nerurkar
May 16
5 min read

Capacity Planning Guide for AKS Cluster — Banking Use Case

1. Understand the Inputs

Parameter	Banking Standard / Example Input
Concurrent Users	10,000+
TPS (Transactions per Second)	1,000 to 2,000
Microservices	~30 Spring Boot microservices
Replicas per service	3 replicas (for HA and load balancing)
Sidecars	1 Istio Envoy proxy per app pod
Messaging	Kafka cluster with 3 brokers + 3 ZK
Observability stack	Prometheus, Grafana, ELK (6 pods approx.)
Node capacity	~7 pods per node (accounting for overhead)

2. Calculate Total Pods Required Per Region

Component	Pod Count Calculation	Result
Microservice pods	30 services × 3 replicas	90
Istio sidecars (one per pod)	90	90
Kafka brokers + Zookeeper	3 + 3	6
Istio control plane	Pilot, Mixer, Ingress Gateway (approximate)	6
Observability stack	Prometheus, Grafana, ELK (~6 pods)	6
System/DaemonSet pods	Kubernetes system pods, monitoring agents (approx.)	10
Total pods	Sum of all	208

3. Calculate Number of Nodes Needed

Assume 7 pods per node as a safe average (leaves room for system overhead and sidecars)
Total nodes per region = Total pods / pods per node

Nodes per region=208/7≈30\text{Nodes per region} = \frac{208}{7} \approx 30Nodes per region

Add 30–50% buffer for autoscaling, failures, and burst loads

Max nodes per region=30×1.5=45\text{Max nodes per region} = 30 \times 1.5 = 45Max nodes per region=30×1.5=45

4. Scale Pods and Nodes Based on TPS

Estimate TPS per pod:
Industry benchmark: 50–100 TPS per Spring Boot pod (depends on JVM tuning, DB latency)
For 2000 TPS peak load:

Pods required=2000/75≈27 pods

Since you have 30 services × 3 replicas = 90 pods baseline (more than enough), scaling is often driven by load spikes on specific services.
Use Horizontal Pod Autoscaler (HPA) to increase replicas for high load microservices.
When pods cannot be scheduled due to lack of resources, Cluster Autoscaler (CA) adds nodes.

5. Node Pools Design

Node Pool	Purpose	Approx. Nodes	Notes
App Services	Spring Boot microservices	20–25	Main business logic
Kafka Cluster	Kafka Brokers and Zookeeper	6	High IOPS, persistent storage
Istio Components	Ingress gateway, control plane	3–5	Control plane stability
Observability	Prometheus, Grafana, ELK	3–5	Monitoring and logging
System	Kubernetes system pods	1–3	Essential node-level system services

6. Active-Active Setup

Deploy two identical AKS clusters in north India and West India.
Each cluster sized as above (min ~30 nodes, max ~45 nodes).
Use Kafka MirrorMaker2 or Confluent Replicator for geo-replication.
Multi-AZ enabled inside each region for fault tolerance.
Traffic routed with Azure Front Door or Traffic Manager.

7. Scaling Triggers and Strategy

What to Scale	How / When
Pods (HPA)	CPU > 70%, Memory > 75%, Kafka lag threshold
Nodes (Cluster Autoscaler)	Pods unschedulable due to resource limits
Kafka Brokers	Monitor throughput, disk usage, increase if lag spikes
Istio/Observability	Scale with HPA or manually for stability

8. Additional Considerations

JVM tuning for microservices critical to TPS performance.
Use KEDA for event-driven autoscaling of Kafka consumers.
Use PodDisruptionBudgets to ensure minimum availability during scaling/updates.
Use taints/tolerations for node pool isolation.
Monitor latency, error rates, and throughput continuously using Prometheus and Application Insights.

Summary Table per Region

Component	Pods	Nodes (Min)	Nodes (Max)
Spring Boot microservices	90
Istio Sidecars	90
Kafka + Zookeeper	6
Istio Control Plane	6
Observability Stack	6
System / DaemonSets	10
Total Pods	208	30	45

Step-by-Step Capacity Planning Calculation

Step 1: Estimate Number of Application Pods

You have 30 microservices, each deployed with 3 replicas for availability and load balancing.

App pods=30×3=90 pods

Step 2: Add Sidecar Pods for Istio

Istio injects one Envoy sidecar per app pod for service mesh features.

Sidecar pods=App pods=90 pods

Step 3: Add Platform and System Pods

Component	Pods Estimate
Kafka brokers + Zookeeper	3 + 3 = 6 pods
Istio control plane + ingress	~6 pods
Observability stack (Prometheus, Grafana, ELK)	~6 pods
System DaemonSets (DNS, monitoring, logging)	~10 pods

Infra pods=6+6+6+10=28 pods\text{Infra pods} = 6 + 6 + 6 + 10 = 28 \text{ pods}Infra pods=6+6+6+10=28 pods

Step 4: Calculate Total Pods Per Region

Total pods=App pods+Sidecar pods+Infra pods=90+90+28=208 pods

Step 5: Estimate Pods Per Node

Industry best practice suggests ~7 pods per AKS node to allow CPU/memory overhead for Kubernetes system components, sidecars, and ensure stability.

Pods per node≈7\text{Pods per node} \approx 7Pods per node≈7

Step 6: Calculate Number of Nodes Needed

Divide total pods by pods per node:

Min nodes=208/7≈30 nodes {Min nodes}

Add buffer for autoscaling and failover (~50%):

Max nodes=30×1.5=45 nodes{Max nodes}

Step 7: Verify TPS and Pod Throughput

Assume each Spring Boot pod handles ~75 TPS based on JVM, DB, and network tuning.
For peak 2000 TPS:

Pods required for TPS=2000/75≈27 pods

Since you have 90 pods running, the system can handle up to 75 TPS × 90 = 6750 TPS (theoretically) at peak capacity, which gives room for scaling, retries, and spiky traffic.

Step 8: Active-Active Setup

The above sizing is per region (north India and West India).
Each region runs a full AKS cluster sized with 30-45 nodes for HA and compliance.
Kafka is geo-replicated for data consistency.

Summary:

Calculation Aspect	Value
App pods	90
Sidecar pods	90
Infra pods	28
Total pods	208
Pods per node	7
Min nodes	30
Max nodes	45
TPS per pod	75
TPS supported (90 pods)	6750 TPS (theoretical max)

“Add Platform and System Pods” estimates come from understanding the typical infrastructure and platform services running in an AKS cluster supporting a microservices banking app with Kafka, Istio, and observability tools.

Here’s how each part contributes to the pod count:

1. Kafka Brokers + Zookeeper

Kafka usually runs as a cluster of brokers, with 3 brokers for fault tolerance and throughput.
Zookeeper is needed by Kafka for metadata management, typically 3 nodes for HA.

Total: 3 (Kafka) + 3 (ZK) = 6 pods

2. Istio Control Plane + Ingress Gateway

Istio’s control plane consists of multiple components:
- Pilot (service discovery)
- Mixer (policy and telemetry)
- Citadel (security)
- Ingress Gateway (edge routing)
Typically, you run:
- 4 control plane pods (for redundancy and load)
- 2 ingress gateway pods (for HA ingress routing)

Total: 4 + 2 = 6 pods

3. Observability Stack

The monitoring and logging stack is usually composed of:
- Prometheus server
- Grafana dashboard
- ELK Stack (ElasticSearch, Logstash, Kibana)
For HA and scaling, these run on multiple pods:
- Prometheus: 2 pods (primary + replica)
- Grafana: 2 pods
- ELK components: 2 pods (could be more depending on scale)

Estimated total: ~6 pods

4. System / DaemonSet Pods

Kubernetes system components include:
- CoreDNS
- kube-proxy
- Azure monitoring agents
- Fluentd/Log collectors
- Other DaemonSets for node-level operations
These are typically running one pod per node; for 30 nodes, this would be 30 pods but since we count per cluster, a safe estimate is around 10 pods to cover essential system services that consume cluster resources (some overlap with DaemonSets running on every node).

Summary Table:

Platform Component	Pods Estimate	Reasoning
Kafka Brokers + Zookeeper	6	3 brokers + 3 ZK nodes
Istio Control Plane + Gateway	6	4 control plane + 2 ingress pods
Observability (Prom, Grafana, ELK)	6	Multi-pod HA for monitoring stack
Kubernetes System/DaemonSets	10	Node agents and system pods
Total	28 pods	Sum of above

Why approximate?

Exact counts depend on cluster size, workload, and HA requirements.
These values represent a typical production-grade setup for a banking system with multiple microservices and observability tools.
You can tune up or down depending on your architecture maturity and scale.

is this for only one region south india, wha t abt other region west india

Yes, the pod and node counts I shared are for a single AKS cluster in one region — for example, north India.

What about the other region (West India)?

You would deploy an identical AKS cluster in West India as part of your active-active setup.
This means the same pod counts and node counts apply per region, assuming the workload and scale are similar.
Each region operates independently but replicates data (e.g., via Kafka MirrorMaker2) for resiliency and disaster recovery.

Summary:

Region	Pods (approx.)	Nodes (min)	Nodes (max)
South India	~208	~30	~45
West India	~208	~30	~45

Why replicate clusters?

High availability: If one region goes down, the other continues serving.
Low latency: Users routed to the nearest region.
Regulatory compliance: Data residency and disaster recovery mandates.
Scalability: Each region scales independently based on local demand.

Capacity Planning-Digital Banking

Capacity Planning Guide for AKS Cluster — Banking Use Case

1. Understand the Inputs

2. Calculate Total Pods Required Per Region

3. Calculate Number of Nodes Needed

4. Scale Pods and Nodes Based on TPS

5. Node Pools Design

6. Active-Active Setup

7. Scaling Triggers and Strategy

8. Additional Considerations

Summary Table per Region

Step-by-Step Capacity Planning Calculation

Step 1: Estimate Number of Application Pods

Step 2: Add Sidecar Pods for Istio

Step 3: Add Platform and System Pods

Step 4: Calculate Total Pods Per Region

Step 5: Estimate Pods Per Node

Step 6: Calculate Number of Nodes Needed

Step 7: Verify TPS and Pod Throughput

Step 8: Active-Active Setup

Summary:

1. Kafka Brokers + Zookeeper

2. Istio Control Plane + Ingress Gateway

3. Observability Stack

4. System / DaemonSet Pods

Summary Table:

Why approximate?

What about the other region (West India)?

Summary:

Why replicate clusters?

Recent Posts

Comments