Capacity Planning-Digital Banking
- Anand Nerurkar
- May 16
- 5 min read
Capacity Planning Guide for AKS Cluster — Banking Use Case
1. Understand the Inputs
Parameter | Banking Standard / Example Input |
Concurrent Users | 10,000+ |
TPS (Transactions per Second) | 1,000 to 2,000 |
Microservices | ~30 Spring Boot microservices |
Replicas per service | 3 replicas (for HA and load balancing) |
Sidecars | 1 Istio Envoy proxy per app pod |
Messaging | Kafka cluster with 3 brokers + 3 ZK |
Observability stack | Prometheus, Grafana, ELK (6 pods approx.) |
Node capacity | ~7 pods per node (accounting for overhead) |
2. Calculate Total Pods Required Per Region
Component | Pod Count Calculation | Result |
Microservice pods | 30 services × 3 replicas | 90 |
Istio sidecars (one per pod) | 90 | 90 |
Kafka brokers + Zookeeper | 3 + 3 | 6 |
Istio control plane | Pilot, Mixer, Ingress Gateway (approximate) | 6 |
Observability stack | Prometheus, Grafana, ELK (~6 pods) | 6 |
System/DaemonSet pods | Kubernetes system pods, monitoring agents (approx.) | 10 |
Total pods | Sum of all | 208 |
3. Calculate Number of Nodes Needed
Assume 7 pods per node as a safe average (leaves room for system overhead and sidecars)
Total nodes per region = Total pods / pods per node
Nodes per region=208/7≈30\text{Nodes per region} = \frac{208}{7} \approx 30Nodes per region
Add 30–50% buffer for autoscaling, failures, and burst loads
Max nodes per region=30×1.5=45\text{Max nodes per region} = 30 \times 1.5 = 45Max nodes per region=30×1.5=45
4. Scale Pods and Nodes Based on TPS
Estimate TPS per pod:
Industry benchmark: 50–100 TPS per Spring Boot pod (depends on JVM tuning, DB latency)
For 2000 TPS peak load:
Pods required=2000/75≈27 pods
Since you have 30 services × 3 replicas = 90 pods baseline (more than enough), scaling is often driven by load spikes on specific services.
Use Horizontal Pod Autoscaler (HPA) to increase replicas for high load microservices.
When pods cannot be scheduled due to lack of resources, Cluster Autoscaler (CA) adds nodes.
5. Node Pools Design
Node Pool | Purpose | Approx. Nodes | Notes |
App Services | Spring Boot microservices | 20–25 | Main business logic |
Kafka Cluster | Kafka Brokers and Zookeeper | 6 | High IOPS, persistent storage |
Istio Components | Ingress gateway, control plane | 3–5 | Control plane stability |
Observability | Prometheus, Grafana, ELK | 3–5 | Monitoring and logging |
System | Kubernetes system pods | 1–3 | Essential node-level system services |
6. Active-Active Setup
Deploy two identical AKS clusters in north India and West India.
Each cluster sized as above (min ~30 nodes, max ~45 nodes).
Use Kafka MirrorMaker2 or Confluent Replicator for geo-replication.
Multi-AZ enabled inside each region for fault tolerance.
Traffic routed with Azure Front Door or Traffic Manager.
7. Scaling Triggers and Strategy
What to Scale | How / When |
Pods (HPA) | CPU > 70%, Memory > 75%, Kafka lag threshold |
Nodes (Cluster Autoscaler) | Pods unschedulable due to resource limits |
Kafka Brokers | Monitor throughput, disk usage, increase if lag spikes |
Istio/Observability | Scale with HPA or manually for stability |
8. Additional Considerations
JVM tuning for microservices critical to TPS performance.
Use KEDA for event-driven autoscaling of Kafka consumers.
Use PodDisruptionBudgets to ensure minimum availability during scaling/updates.
Use taints/tolerations for node pool isolation.
Monitor latency, error rates, and throughput continuously using Prometheus and Application Insights.
Summary Table per Region
Component | Pods | Nodes (Min) | Nodes (Max) |
Spring Boot microservices | 90 | ||
Istio Sidecars | 90 | ||
Kafka + Zookeeper | 6 | ||
Istio Control Plane | 6 | ||
Observability Stack | 6 | ||
System / DaemonSets | 10 | ||
Total Pods | 208 | 30 | 45 |
Step-by-Step Capacity Planning Calculation
Step 1: Estimate Number of Application Pods
You have 30 microservices, each deployed with 3 replicas for availability and load balancing.
App pods=30×3=90 pods
Step 2: Add Sidecar Pods for Istio
Istio injects one Envoy sidecar per app pod for service mesh features.
Sidecar pods=App pods=90 pods
Step 3: Add Platform and System Pods
Component | Pods Estimate |
Kafka brokers + Zookeeper | 3 + 3 = 6 pods |
Istio control plane + ingress | ~6 pods |
Observability stack (Prometheus, Grafana, ELK) | ~6 pods |
System DaemonSets (DNS, monitoring, logging) | ~10 pods |
Infra pods=6+6+6+10=28 pods\text{Infra pods} = 6 + 6 + 6 + 10 = 28 \text{ pods}Infra pods=6+6+6+10=28 pods
Step 4: Calculate Total Pods Per Region
Total pods=App pods+Sidecar pods+Infra pods=90+90+28=208 pods
Step 5: Estimate Pods Per Node
Industry best practice suggests ~7 pods per AKS node to allow CPU/memory overhead for Kubernetes system components, sidecars, and ensure stability.
Pods per node≈7\text{Pods per node} \approx 7Pods per node≈7
Step 6: Calculate Number of Nodes Needed
Divide total pods by pods per node:
Min nodes=208/7≈30 nodes {Min nodes}
Add buffer for autoscaling and failover (~50%):
Max nodes=30×1.5=45 nodes{Max nodes}
Step 7: Verify TPS and Pod Throughput
Assume each Spring Boot pod handles ~75 TPS based on JVM, DB, and network tuning.
For peak 2000 TPS:
Pods required for TPS=2000/75≈27 pods
Since you have 90 pods running, the system can handle up to 75 TPS × 90 = 6750 TPS (theoretically) at peak capacity, which gives room for scaling, retries, and spiky traffic.
Step 8: Active-Active Setup
The above sizing is per region (north India and West India).
Each region runs a full AKS cluster sized with 30-45 nodes for HA and compliance.
Kafka is geo-replicated for data consistency.
Summary:
Calculation Aspect | Value |
App pods | 90 |
Sidecar pods | 90 |
Infra pods | 28 |
Total pods | 208 |
Pods per node | 7 |
Min nodes | 30 |
Max nodes | 45 |
TPS per pod | 75 |
TPS supported (90 pods) | 6750 TPS (theoretical max) |
“Add Platform and System Pods” estimates come from understanding the typical infrastructure and platform services running in an AKS cluster supporting a microservices banking app with Kafka, Istio, and observability tools.
Here’s how each part contributes to the pod count:
1. Kafka Brokers + Zookeeper
Kafka usually runs as a cluster of brokers, with 3 brokers for fault tolerance and throughput.
Zookeeper is needed by Kafka for metadata management, typically 3 nodes for HA.
Total: 3 (Kafka) + 3 (ZK) = 6 pods
2. Istio Control Plane + Ingress Gateway
Istio’s control plane consists of multiple components:
Pilot (service discovery)
Mixer (policy and telemetry)
Citadel (security)
Ingress Gateway (edge routing)
Typically, you run:
4 control plane pods (for redundancy and load)
2 ingress gateway pods (for HA ingress routing)
Total: 4 + 2 = 6 pods
3. Observability Stack
The monitoring and logging stack is usually composed of:
Prometheus server
Grafana dashboard
ELK Stack (ElasticSearch, Logstash, Kibana)
For HA and scaling, these run on multiple pods:
Prometheus: 2 pods (primary + replica)
Grafana: 2 pods
ELK components: 2 pods (could be more depending on scale)
Estimated total: ~6 pods
4. System / DaemonSet Pods
Kubernetes system components include:
CoreDNS
kube-proxy
Azure monitoring agents
Fluentd/Log collectors
Other DaemonSets for node-level operations
These are typically running one pod per node; for 30 nodes, this would be 30 pods but since we count per cluster, a safe estimate is around 10 pods to cover essential system services that consume cluster resources (some overlap with DaemonSets running on every node).
Summary Table:
Platform Component | Pods Estimate | Reasoning |
Kafka Brokers + Zookeeper | 6 | 3 brokers + 3 ZK nodes |
Istio Control Plane + Gateway | 6 | 4 control plane + 2 ingress pods |
Observability (Prom, Grafana, ELK) | 6 | Multi-pod HA for monitoring stack |
Kubernetes System/DaemonSets | 10 | Node agents and system pods |
Total | 28 pods |
Why approximate?
Exact counts depend on cluster size, workload, and HA requirements.
These values represent a typical production-grade setup for a banking system with multiple microservices and observability tools.
You can tune up or down depending on your architecture maturity and scale.
is this for only one region south india, wha t abt other region west india
Yes, the pod and node counts I shared are for a single AKS cluster in one region — for example, north India.
What about the other region (West India)?
You would deploy an identical AKS cluster in West India as part of your active-active setup.
This means the same pod counts and node counts apply per region, assuming the workload and scale are similar.
Each region operates independently but replicates data (e.g., via Kafka MirrorMaker2) for resiliency and disaster recovery.
Summary:
Region | Pods (approx.) | Nodes (min) | Nodes (max) |
South India | ~208 | ~30 | ~45 |
West India | ~208 | ~30 | ~45 |
Why replicate clusters?
High availability: If one region goes down, the other continues serving.
Low latency: Users routed to the nearest region.
Regulatory compliance: Data residency and disaster recovery mandates.
Scalability: Each region scales independently based on local demand.
Comments