Reliability-SLI/SLO/SLA
- Anand Nerurkar
- Feb 21
- 6 min read
Updated: Mar 1
⭐ Core SRE Principles (Explain Like This in Interview)
1️⃣ Define SLIs, SLOs, SLAs
SLI (Service Level Indicator) → Measurable metric (e.g., API response time)
SLO (Service Level Objective) → Target value (e.g., 99.95% availability)
SLA (Service Level Agreement) → Business commitment
🎤 Say this:
Reliability must be measurable. I start by defining SLIs and SLOs aligned to business criticality.
2️⃣ Error Budget Concept
If SLO = 99.9% availabilityThen 0.1% downtime = allowed failure window.
If error budget is exhausted:
👉 Stop new feature releases👉 Focus on stability
This balances speed vs reliability.
3️⃣ Automation Over Manual Ops
CI/CD pipelines
Infrastructure as Code
Auto-scaling
Automated failover
4️⃣ Observability
Not just monitoring.
Must include:
Metrics
Logs
Traces
Real-time alerts
🏦 Now — How to Implement SRE in Banking (Practical Walkthrough)
Let’s take example: Digital Lending Platform integrated with core (e.g., Finacle banking platform).
🟦 Step 1: Classify Criticality
Not all services equal.
Example:
Loan disbursement API → Tier 1 critical
Notification service → Tier 2
Define SLO accordingly.
🟦 Step 2: Define SLOs Per Service
Example:
Loan approval API: 99.95% uptime
Core posting integration: 99.99% consistency
Tie to business impact.
🟦 Step 3: Build Observability Stack
Implement:
Centralized logging
Distributed tracing
Real-time alerting
Dashboard for leadership
Say:
Reliability must be transparent to both engineering and business stakeholders.
🟦 Step 4: Implement Resilience Engineering
Circuit breakers
Retry logic
Bulkheads
Load balancing
Failover nodes
For banking, redundancy is mandatory.
🟦 Step 5: Incident Management Governance
Runbooks
RCA process
Blameless postmortems
Regulatory reporting readiness
This is important for Indian banking environment.
🟦 Step 6: Error Budget Governance
If error budget breached:
Freeze feature rollout
Stabilize system
Prioritize reliability sprint
This shows maturity.
🎤
I implement SRE by first defining measurable SLOs aligned to business criticality. Then I enforce observability, automation, and resilience patterns such as circuit breakers and redundancy. I promote error budget governance to balance innovation speed with stability. In banking, SRE must ensure operational continuity, regulatory compliance, and customer trust.
🔹 1️⃣ SLI – Service Level Indicator
Definition:A measurement of system performance.
It is a metric.
Examples in banking:
API success rate = 99.97%
Average response time = 180 ms
Core transaction completion rate = 99.99%
Batch job completion time = 2 hours
Think of SLI as:
📊 “What are we measuring?”
It is raw performance data.
🔹 2️⃣ SLO – Service Level Objective
Definition:The target you set for the SLI.
It is an internal engineering goal.
Example:
SLI: Transaction success rate
SLO: ≥ 99.95% monthly success rate
Another:
SLI: Payment API latency
SLO: 95% of requests < 300 ms
Think of SLO as:
🎯 “What performance are we aiming for?”
SLO drives:
Architecture decisions
Reliability engineering
Investment priorities
Alerting thresholds
🔹 3️⃣ SLA – Service Level Agreement
Definition:A contractual commitment to customers or business.
It includes:
Financial penalties
Legal implications
Compensation clauses
Example:
Bank guarantees 99.9% uptime for internet banking
If breached → penalty or fee waiver
Think of SLA as:
📜 “What we legally promise externally.”
SLA is usually slightly lower than SLO.
Why?Because you need a buffer.
🔹 4️⃣ Error Budget
This is where modern reliability thinking becomes powerful.
If your SLO is:
99.95% availability per month
That means:
Total allowable downtime = 100- SLO for availability=100-99.95= 0.05%
In a 30-day month:
Total minutes = 43,200
0.05% = ~21.6 minutes
That 21.6 minutes is your error budget.
Think of error budget as:
💰 “How much failure we can afford.”
If you exceed it:
Freeze releases
Prioritize stability
Reduce change velocity
If you’re within budget:
Continue innovation
Deploy faster
This balances speed vs reliability.
🔥 Simple Comparison Table
Term | What It Is | Audience | Example |
SLI | Measured metric | Engineering | 99.97% API success |
SLO | Target for metric | Internal goal | ≥ 99.95% success |
SLA | Legal commitment | Customer/Business | 99.9% uptime guarantee |
Error Budget | Allowed failure | Engineering governance | 21 mins downtime/month |
🏦 Banking Context Example (Core Banking)
SLI: Core transaction success rate
SLO: ≥ 99.99%
SLA: 99.9% uptime to customers
Error budget: ~4.3 minutes/month
🧠 Architect-Level Insight
In modern banks:
SLO drives engineering culture
SLA drives legal exposure
Error budget drives release governance
🏦 1️⃣ SLO Model Specifically for Core Banking
When designing SLOs for Core Banking, we don’t treat everything equally.
We classify by criticality tier.
🔴 Tier 0 – Real-Time Financial Transactions (Highest Criticality)
Examples:
Fund transfer (IMPS/NEFT/RTGS)
Account debit/credit
Balance update
ATM withdrawal
SLIs
Transaction success rate
End-to-end latency
Data consistency rate
Reconciliation mismatch rate
SLO
Success rate ≥ 99.99%
95th percentile latency < 500ms
Data inconsistency < 0.001%
Zero data loss
Error Budget
99.99% availability = ~4.3 minutes/month
If error budget exhausted:
Release freeze
Incident RCA within 24 hrs
Executive review
🟠 Tier 1 – Customer Channels (Mobile / Internet Banking)
SLIs
Channel availability
Login success rate
API latency
SLO
Availability ≥ 99.95%
Login success ≥ 99.9%
API p95 latency < 800ms
Error budget ≈ 21 minutes/month
🟡 Tier 2 – Back Office / Batch Processing
SLIs
EOD batch completion time
Report generation success rate
SLO
100% batch completion before 6 AM
≥ 99.5% job success rate
🧠 Architectural Controls to Meet These SLOs
Active-active DC setup
Synchronous DB replication
Circuit breakers
Retry patterns
Graceful degradation
Observability with golden signals (latency, traffic, errors, saturation)
SLO must influence architecture — not just monitoring.
🔥 2️⃣ “Is 99.95% Enough for Banking?”
99.95% sounds high. Why isn’t that enough?
“99.95% translates to roughly 21 minutes of downtime per month. For core financial posting systems, even 5–10 minutes during peak business hours can cause regulatory exposure, reconciliation impact, and reputational damage.
For customer channels, 99.95% may be acceptable.For core transaction posting — it is not.
We tier SLOs based on business criticality rather than applying a flat percentage across the bank.”
Then why not 100%?
“100% reliability is economically impractical. Achieving 100% requires extreme redundancy and operational cost. Instead, we define error budgets aligned to business risk appetite and regulatory exposure.”
What happens when you exhaust error budget?
“We automatically slow down release velocity, prioritize reliability engineering, and escalate to architecture governance. It becomes a business decision, not just an engineering one.”
That’s the maturity answer.
📊 3️⃣
Enterprise Reliability Governance Framework
1️⃣ Business Tiering
Tier 0: Core Financial Posting – 99.99%
Tier 1: Channels – 99.95%
Tier 2: Back Office – 99.5%
2️⃣ Defined SLIs
Availability
Latency (p95 / p99)
Error rate
Data integrity
3️⃣ Error Budget Governance
Defined per service
Linked to release velocity
Breach triggers:
RCA
Change freeze
Architecture review
4️⃣ SLA Alignment
SLA < SLO (buffer maintained)
Financial penalties modeled
Risk quantified
Reliability is a governed risk decision, not an afterthought metric.
🧠
“As an Enterprise Architect, I don’t define uptime percentages. I design reliability as a risk-aligned, business-tiered governance model with measurable error budgets.”
🎯 Step 1: Define What Reliability Means in Banking
In banking, reliability means:
No data loss
No duplicate transactions
No financial inconsistencies
High availability (99.99%+)
Predictable performance
Fast recovery from failure
So reliability = Correctness + Availability + Resilience + Recoverability
🏗 Step 2: Reliability at Each Layer
We ensure reliability across 7 layers.
1️⃣ Infrastructure Reliability
In Cloud (Digital Layer)
Multi-AZ deployment
Auto-scaling groups
Managed Kubernetes (AKS / EKS)
Load balancers with health checks
Distributed Kafka cluster (replication factor ≥ 3)
On-Prem (CBS Side)
Active-passive or active-active CBS
Database replication
Redundant network paths
Dual firewalls
2️⃣ Service-Level Reliability (Microservices)
Each microservice must:
✅ Be Stateless
No session stored locally.
✅ Use Circuit Breaker
If CBS is slow:
Stop calling
Return fallback response
Prevent cascading failure
✅ Timeouts + Retries
Set strict timeout (e.g., 2s)
Retry with exponential backoff
Max retry threshold
✅ Bulkhead Pattern
Separate connection pools for:
CBS calls
LOS calls
LMS calls
Prevents one failure from affecting entire system.
3️⃣ Data Reliability
This is most critical in banking.
🔐 Idempotency
Every request includes:
X-Idempotency-Key = LoanID or TransactionIDPrevents duplicate loan creation.
🔄 Outbox Pattern
When service updates DB:
Save business data
Save event in outbox table
Background process publishes event to Kafka
Guarantees:
No event loss
No partial update
📦 Kafka Reliability
Replication factor ≥ 3
Acknowledgment level = ALL
Dead letter queues for failed messages
Consumer offset tracking
4️⃣ Transaction Reliability (Saga Pattern)
Loan creation involves:
Loan approval
CBS account creation
LMS loan schedule
Disbursement
We use Orchestrated Saga Pattern:
If CBS fails:
Compensate → mark loan as FAILED
Do not proceed to LMS
This prevents inconsistent state.
5️⃣ Hybrid Network Reliability
Between Cloud ↔ On-Prem:
Private connectivity (ExpressRoute / MPLS)
Mutual TLS
Retry logic at adapter
Secondary failover endpoint
If primary CBS endpoint fails:
Switch to secondary
6️⃣ Monitoring & Observability
Reliability without visibility is impossible.
Metrics
API latency
Error rate
CBS response time
Kafka lag
Tracing
Track full journey:
User → Gateway → LoanSvc → Adapter → CBSAlerting
SLA breach alerts
Kafka consumer lag alerts
DB replication lag alerts
7️⃣ Recovery & Reconciliation
Even with best design, failures happen.
So we implement:
Near Real-Time Reconciliation
Digital vs LOS vs LMS vs CBS comparison.
End-of-Day ARB
Financial reconciliation.
Replay Capability
Kafka allows replaying events from offset.
Manual Override Dashboard
Ops team can:
Retry
Reconcile
Re-trigger events
🛡 Reliability Example Scenario
Scenario:
Loan account created in CBS but response lost.
Without reliability:→ Duplicate loan risk.
With reliability:
Adapter uses idempotency key.
If retry happens:
CBS detects duplicate
Returns existing loan account
Reconciliation confirms consistency.
No financial corruption.
📊
How will you ensure reliability in hybrid banking architecture?
“I design reliability across infrastructure, service, data, and operational layers. I ensure stateless microservices with circuit breakers, idempotent APIs, outbox-based event publishing, Kafka replication, saga-based transaction management, and continuous reconciliation between digital and core systems. Additionally, we implement observability and automated recovery mechanisms to maintain 99.99% availability.”
That answer shows maturity.
🧠 Bonus: Reliability Pyramid
Infrastructure Stability ↓Service Resilience ↓Data Consistency ↓Transaction Integrity ↓Monitoring & Recovery.png)

Comments