Key Performance Metrices in Microservices Architecture
- Anand Nerurkar
- Aug 20
- 2 min read
🔹 1. Infrastructure & Platform Metrics
Track system health and scalability.
CPU, Memory, Disk, Network Utilization → To ensure pods/containers are not resource-starved.
Container & Pod Restarts (Kubernetes) → Indicates stability issues.
Autoscaling Effectiveness → How well HPA (Horizontal Pod Autoscaler) or cluster scaling is reacting.
Error Events (K8s, Istio, etc.) → CrashLoopBackOff, OOMKilled, etc.
🔹 2. Service-Level Metrics (Golden Signals from SRE)
These are Istio/Envoy + App telemetry based.
Latency (p95, p99) → Response time of each service/API.
Traffic → Number of requests per second (RPS), throughput (QPS).
Errors → 4xx (client issues), 5xx (server failures), gRPC errors.
Saturation → Queue depth, thread pool usage.
🔹 3. Application Metrics
API Success vs Failure Rate → Helps in SLA/SLO tracking.
Dependency Latency → DB queries, Kafka topics, downstream services.
Circuit Breaker Metrics (Istio/Resilience4j) → Open/half-open states, retries.
Cache Hit Ratio (Redis, CDN) → Efficiency of caching strategy.
DB Metrics → Slow queries, connection pool utilization, replication lag.
🔹 4. Reliability & Resilience Metrics
MTTR (Mean Time to Recovery)
MTBF (Mean Time Between Failures)
Service Availability (uptime %)
Request Retry Count & Fallback Usage
🔹 5. Security Metrics
Auth/Token Failures (OAuth/JWT/OPA)
mTLS handshake failures (Istio)
Unauthorized access attempts
Vulnerability scan & patch compliance
🔹 6. Business & Domain Metrics
Directly tied to business outcomes.
Transaction Success Rate (e.g., successful payments/loans approved).
User Onboarding Conversion Rate
Revenue-impacting Latency (checkout, payment APIs).
Abandonment Rate (loan journey drop-offs).
Fraud Detection Accuracy (false positives/negatives).
🔹 7. Observability & Tracing Metrics
Distributed Tracing (Jaeger/Zipkin/Tempo) → End-to-end latency.
Log Volume & Error Patterns (ELK, Loki, Splunk).
Service-to-Service Dependency Graph (Istio Kiali).
✅ In practice, you should track a combination of Golden Signals + Business KPIs.
Golden Signals (Latency, Traffic, Errors, Saturation) → For SRE/DevOps.
Business KPIs → For Product/Management.
Comments