Reliability-SLI/SLO/SLA

Anand Nerurkar
Feb 21
6 min read

Updated: Mar 1

⭐ Core SRE Principles (Explain Like This in Interview)

1️⃣ Define SLIs, SLOs, SLAs

SLI (Service Level Indicator) → Measurable metric (e.g., API response time)
SLO (Service Level Objective) → Target value (e.g., 99.95% availability)
SLA (Service Level Agreement) → Business commitment

🎤 Say this:

Reliability must be measurable. I start by defining SLIs and SLOs aligned to business criticality.

2️⃣ Error Budget Concept

If SLO = 99.9% availabilityThen 0.1% downtime = allowed failure window.

If error budget is exhausted:

👉 Stop new feature releases👉 Focus on stability

This balances speed vs reliability.

3️⃣ Automation Over Manual Ops

CI/CD pipelines
Infrastructure as Code
Auto-scaling
Automated failover

4️⃣ Observability

Not just monitoring.

Must include:

Metrics
Logs
Traces
Real-time alerts

🏦 Now — How to Implement SRE in Banking (Practical Walkthrough)

Let’s take example: Digital Lending Platform integrated with core (e.g., Finacle banking platform).

🟦 Step 1: Classify Criticality

Not all services equal.

Example:

Loan disbursement API → Tier 1 critical
Notification service → Tier 2

Define SLO accordingly.

🟦 Step 2: Define SLOs Per Service

Example:

Loan approval API: 99.95% uptime
Core posting integration: 99.99% consistency

Tie to business impact.

🟦 Step 3: Build Observability Stack

Implement:

Centralized logging
Distributed tracing
Real-time alerting
Dashboard for leadership

Say:

Reliability must be transparent to both engineering and business stakeholders.

🟦 Step 4: Implement Resilience Engineering

Circuit breakers
Retry logic
Bulkheads
Load balancing
Failover nodes

For banking, redundancy is mandatory.

🟦 Step 5: Incident Management Governance

Runbooks
RCA process
Blameless postmortems
Regulatory reporting readiness

This is important for Indian banking environment.

🟦 Step 6: Error Budget Governance

If error budget breached:

Freeze feature rollout
Stabilize system
Prioritize reliability sprint

This shows maturity.

🎤

I implement SRE by first defining measurable SLOs aligned to business criticality. Then I enforce observability, automation, and resilience patterns such as circuit breakers and redundancy. I promote error budget governance to balance innovation speed with stability. In banking, SRE must ensure operational continuity, regulatory compliance, and customer trust.

🔹 1️⃣ SLI – Service Level Indicator

Definition:A measurement of system performance.

It is a metric.

Examples in banking:

API success rate = 99.97%
Average response time = 180 ms
Core transaction completion rate = 99.99%
Batch job completion time = 2 hours

Think of SLI as:

📊 “What are we measuring?”

It is raw performance data.

🔹 2️⃣ SLO – Service Level Objective

Definition:The target you set for the SLI.

It is an internal engineering goal.

Example:

SLI: Transaction success rate
SLO: ≥ 99.95% monthly success rate

Another:

SLI: Payment API latency
SLO: 95% of requests < 300 ms

Think of SLO as:

🎯 “What performance are we aiming for?”

SLO drives:

Architecture decisions
Reliability engineering
Investment priorities
Alerting thresholds

🔹 3️⃣ SLA – Service Level Agreement

Definition:A contractual commitment to customers or business.

It includes:

Financial penalties
Legal implications
Compensation clauses

Example:

Bank guarantees 99.9% uptime for internet banking
If breached → penalty or fee waiver

Think of SLA as:

📜 “What we legally promise externally.”

SLA is usually slightly lower than SLO.

Why?Because you need a buffer.

🔹 4️⃣ Error Budget

This is where modern reliability thinking becomes powerful.

If your SLO is:

99.95% availability per month

That means:

Total allowable downtime = 100- SLO for availability=100-99.95= 0.05%

In a 30-day month:

Total minutes = 43,200
0.05% = ~21.6 minutes

That 21.6 minutes is your error budget.

Think of error budget as:

💰 “How much failure we can afford.”

If you exceed it:

Freeze releases
Prioritize stability
Reduce change velocity

If you’re within budget:

Continue innovation
Deploy faster

This balances speed vs reliability.

🔥 Simple Comparison Table

Term	What It Is	Audience	Example
SLI	Measured metric	Engineering	99.97% API success
SLO	Target for metric	Internal goal	≥ 99.95% success
SLA	Legal commitment	Customer/Business	99.9% uptime guarantee
Error Budget	Allowed failure	Engineering governance	21 mins downtime/month

🏦 Banking Context Example (Core Banking)

SLI: Core transaction success rate
SLO: ≥ 99.99%
SLA: 99.9% uptime to customers
Error budget: ~4.3 minutes/month

🧠 Architect-Level Insight

In modern banks:

SLO drives engineering culture
SLA drives legal exposure
Error budget drives release governance

🏦 1️⃣ SLO Model Specifically for Core Banking

When designing SLOs for Core Banking, we don’t treat everything equally.

We classify by criticality tier.

🔴 Tier 0 – Real-Time Financial Transactions (Highest Criticality)

Examples:

Fund transfer (IMPS/NEFT/RTGS)
Account debit/credit
Balance update
ATM withdrawal

SLIs

Transaction success rate
End-to-end latency
Data consistency rate
Reconciliation mismatch rate

SLO

Success rate ≥ 99.99%
95th percentile latency < 500ms
Data inconsistency < 0.001%
Zero data loss

Error Budget

99.99% availability = ~4.3 minutes/month

If error budget exhausted:

Release freeze
Incident RCA within 24 hrs
Executive review

🟠 Tier 1 – Customer Channels (Mobile / Internet Banking)

SLIs

Channel availability
Login success rate
API latency

SLO

Availability ≥ 99.95%
Login success ≥ 99.9%
API p95 latency < 800ms

Error budget ≈ 21 minutes/month

🟡 Tier 2 – Back Office / Batch Processing

SLIs

EOD batch completion time
Report generation success rate

SLO

100% batch completion before 6 AM
≥ 99.5% job success rate

🧠 Architectural Controls to Meet These SLOs

Active-active DC setup
Synchronous DB replication
Circuit breakers
Retry patterns
Graceful degradation
Observability with golden signals (latency, traffic, errors, saturation)

SLO must influence architecture — not just monitoring.

🔥 2️⃣ “Is 99.95% Enough for Banking?”

99.95% sounds high. Why isn’t that enough?

“99.95% translates to roughly 21 minutes of downtime per month. For core financial posting systems, even 5–10 minutes during peak business hours can cause regulatory exposure, reconciliation impact, and reputational damage.

For customer channels, 99.95% may be acceptable.For core transaction posting — it is not.

We tier SLOs based on business criticality rather than applying a flat percentage across the bank.”

Then why not 100%?

“100% reliability is economically impractical. Achieving 100% requires extreme redundancy and operational cost. Instead, we define error budgets aligned to business risk appetite and regulatory exposure.”

What happens when you exhaust error budget?

“We automatically slow down release velocity, prioritize reliability engineering, and escalate to architecture governance. It becomes a business decision, not just an engineering one.”

That’s the maturity answer.

📊 3️⃣

Enterprise Reliability Governance Framework

1️⃣ Business Tiering

Tier 0: Core Financial Posting – 99.99%
Tier 1: Channels – 99.95%
Tier 2: Back Office – 99.5%

2️⃣ Defined SLIs

Availability
Latency (p95 / p99)
Error rate
Data integrity

3️⃣ Error Budget Governance

Defined per service
Linked to release velocity
Breach triggers:
- RCA
- Change freeze
- Architecture review

4️⃣ SLA Alignment

SLA < SLO (buffer maintained)
Financial penalties modeled
Risk quantified

Reliability is a governed risk decision, not an afterthought metric.

🧠

“As an Enterprise Architect, I don’t define uptime percentages. I design reliability as a risk-aligned, business-tiered governance model with measurable error budgets.”

🎯 Step 1: Define What Reliability Means in Banking

In banking, reliability means:

No data loss
No duplicate transactions
No financial inconsistencies
High availability (99.99%+)
Predictable performance
Fast recovery from failure

So reliability = Correctness + Availability + Resilience + Recoverability

🏗 Step 2: Reliability at Each Layer

We ensure reliability across 7 layers.

1️⃣ Infrastructure Reliability

In Cloud (Digital Layer)

Multi-AZ deployment
Auto-scaling groups
Managed Kubernetes (AKS / EKS)
Load balancers with health checks
Distributed Kafka cluster (replication factor ≥ 3)

On-Prem (CBS Side)

Active-passive or active-active CBS
Database replication
Redundant network paths
Dual firewalls

2️⃣ Service-Level Reliability (Microservices)

Each microservice must:

✅ Be Stateless

No session stored locally.

✅ Use Circuit Breaker

If CBS is slow:

Stop calling
Return fallback response
Prevent cascading failure

✅ Timeouts + Retries

Set strict timeout (e.g., 2s)
Retry with exponential backoff
Max retry threshold

✅ Bulkhead Pattern

Separate connection pools for:

CBS calls
LOS calls
LMS calls

Prevents one failure from affecting entire system.

3️⃣ Data Reliability

This is most critical in banking.

🔐 Idempotency

Every request includes:

X-Idempotency-Key = LoanID or TransactionID

Prevents duplicate loan creation.

🔄 Outbox Pattern

When service updates DB:

Save business data
Save event in outbox table
Background process publishes event to Kafka

Guarantees:

No event loss
No partial update

📦 Kafka Reliability

Replication factor ≥ 3
Acknowledgment level = ALL
Dead letter queues for failed messages
Consumer offset tracking

4️⃣ Transaction Reliability (Saga Pattern)

Loan creation involves:

Loan approval
CBS account creation
LMS loan schedule
Disbursement

We use Orchestrated Saga Pattern:

If CBS fails:

Compensate → mark loan as FAILED
Do not proceed to LMS

This prevents inconsistent state.

5️⃣ Hybrid Network Reliability

Between Cloud ↔ On-Prem:

Private connectivity (ExpressRoute / MPLS)
Mutual TLS
Retry logic at adapter
Secondary failover endpoint

If primary CBS endpoint fails:

Switch to secondary

6️⃣ Monitoring & Observability

Reliability without visibility is impossible.

Metrics

API latency
Error rate
CBS response time
Kafka lag

Tracing

Track full journey:

User → Gateway → LoanSvc → Adapter → CBS

Alerting

SLA breach alerts
Kafka consumer lag alerts
DB replication lag alerts

7️⃣ Recovery & Reconciliation

Even with best design, failures happen.

So we implement:

Near Real-Time Reconciliation

Digital vs LOS vs LMS vs CBS comparison.

End-of-Day ARB

Financial reconciliation.

Replay Capability

Kafka allows replaying events from offset.

Manual Override Dashboard

Ops team can:

Retry
Reconcile
Re-trigger events

🛡 Reliability Example Scenario

Scenario:

Loan account created in CBS but response lost.

Without reliability:→ Duplicate loan risk.

With reliability:

Adapter uses idempotency key.
If retry happens:
- CBS detects duplicate
- Returns existing loan account
Reconciliation confirms consistency.

No financial corruption.

📊

How will you ensure reliability in hybrid banking architecture?

“I design reliability across infrastructure, service, data, and operational layers. I ensure stateless microservices with circuit breakers, idempotent APIs, outbox-based event publishing, Kafka replication, saga-based transaction management, and continuous reconciliation between digital and core systems. Additionally, we implement observability and automated recovery mechanisms to maintain 99.99% availability.”

That answer shows maturity.

🧠 Bonus: Reliability Pyramid

Infrastructure Stability        ↓Service Resilience        ↓Data Consistency        ↓Transaction Integrity        ↓Monitoring & Recovery

⭐ Core SRE Principles (Explain Like This in Interview)

1️⃣ Define SLIs, SLOs, SLAs

2️⃣ Error Budget Concept

3️⃣ Automation Over Manual Ops

4️⃣ Observability

🏦 Now — How to Implement SRE in Banking (Practical Walkthrough)

🟦 Step 1: Classify Criticality

🟦 Step 2: Define SLOs Per Service

🟦 Step 3: Build Observability Stack

🟦 Step 4: Implement Resilience Engineering

🟦 Step 5: Incident Management Governance

🟦 Step 6: Error Budget Governance

🎤

🔹 1️⃣ SLI – Service Level Indicator

🔹 2️⃣ SLO – Service Level Objective

🔹 3️⃣ SLA – Service Level Agreement

🔹 4️⃣ Error Budget

🔥 Simple Comparison Table

🏦 Banking Context Example (Core Banking)

🧠 Architect-Level Insight

🏦 1️⃣ SLO Model Specifically for Core Banking

🔴 Tier 0 – Real-Time Financial Transactions (Highest Criticality)

SLIs

SLO

Error Budget

🟠 Tier 1 – Customer Channels (Mobile / Internet Banking)

SLIs

SLO

🟡 Tier 2 – Back Office / Batch Processing

SLIs

SLO

🧠 Architectural Controls to Meet These SLOs

🔥 2️⃣ “Is 99.95% Enough for Banking?”

📊 3️⃣

Enterprise Reliability Governance Framework

1️⃣ Business Tiering

2️⃣ Defined SLIs

3️⃣ Error Budget Governance

4️⃣ SLA Alignment

🧠

🎯 Step 1: Define What Reliability Means in Banking

🏗 Step 2: Reliability at Each Layer

1️⃣ Infrastructure Reliability

In Cloud (Digital Layer)

On-Prem (CBS Side)

2️⃣ Service-Level Reliability (Microservices)

✅ Be Stateless

✅ Use Circuit Breaker

✅ Timeouts + Retries

✅ Bulkhead Pattern

3️⃣ Data Reliability

🔐 Idempotency

🔄 Outbox Pattern

📦 Kafka Reliability

4️⃣ Transaction Reliability (Saga Pattern)

5️⃣ Hybrid Network Reliability

6️⃣ Monitoring & Observability

Metrics

Tracing

Alerting

7️⃣ Recovery & Reconciliation

Near Real-Time Reconciliation

End-of-Day ARB

Replay Capability

Manual Override Dashboard

🛡 Reliability Example Scenario

Scenario:

📊

🧠 Bonus: Reliability Pyramid

Comments