top of page

Reliability-SLI/SLO/SLA

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Feb 21
  • 6 min read

Updated: Mar 1

⭐ Core SRE Principles (Explain Like This in Interview)

1️⃣ Define SLIs, SLOs, SLAs

  • SLI (Service Level Indicator) → Measurable metric (e.g., API response time)

  • SLO (Service Level Objective) → Target value (e.g., 99.95% availability)

  • SLA (Service Level Agreement) → Business commitment

🎤 Say this:

Reliability must be measurable. I start by defining SLIs and SLOs aligned to business criticality.

2️⃣ Error Budget Concept

If SLO = 99.9% availabilityThen 0.1% downtime = allowed failure window.

If error budget is exhausted:

👉 Stop new feature releases👉 Focus on stability

This balances speed vs reliability.

3️⃣ Automation Over Manual Ops

  • CI/CD pipelines

  • Infrastructure as Code

  • Auto-scaling

  • Automated failover

4️⃣ Observability

Not just monitoring.

Must include:

  • Metrics

  • Logs

  • Traces

  • Real-time alerts

🏦 Now — How to Implement SRE in Banking (Practical Walkthrough)

Let’s take example: Digital Lending Platform integrated with core (e.g., Finacle banking platform).

🟦 Step 1: Classify Criticality

Not all services equal.

Example:

  • Loan disbursement API → Tier 1 critical

  • Notification service → Tier 2

Define SLO accordingly.

🟦 Step 2: Define SLOs Per Service

Example:

  • Loan approval API: 99.95% uptime

  • Core posting integration: 99.99% consistency

Tie to business impact.

🟦 Step 3: Build Observability Stack

Implement:

  • Centralized logging

  • Distributed tracing

  • Real-time alerting

  • Dashboard for leadership

Say:

Reliability must be transparent to both engineering and business stakeholders.

🟦 Step 4: Implement Resilience Engineering

  • Circuit breakers

  • Retry logic

  • Bulkheads

  • Load balancing

  • Failover nodes

For banking, redundancy is mandatory.

🟦 Step 5: Incident Management Governance

  • Runbooks

  • RCA process

  • Blameless postmortems

  • Regulatory reporting readiness

This is important for Indian banking environment.

🟦 Step 6: Error Budget Governance

If error budget breached:

  • Freeze feature rollout

  • Stabilize system

  • Prioritize reliability sprint

This shows maturity.

🎤

I implement SRE by first defining measurable SLOs aligned to business criticality. Then I enforce observability, automation, and resilience patterns such as circuit breakers and redundancy. I promote error budget governance to balance innovation speed with stability. In banking, SRE must ensure operational continuity, regulatory compliance, and customer trust.


🔹 1️⃣ SLI – Service Level Indicator

Definition:A measurement of system performance.

It is a metric.

Examples in banking:

  • API success rate = 99.97%

  • Average response time = 180 ms

  • Core transaction completion rate = 99.99%

  • Batch job completion time = 2 hours

Think of SLI as:

📊 “What are we measuring?”

It is raw performance data.

🔹 2️⃣ SLO – Service Level Objective

Definition:The target you set for the SLI.

It is an internal engineering goal.

Example:

  • SLI: Transaction success rate

  • SLO: ≥ 99.95% monthly success rate

Another:

  • SLI: Payment API latency

  • SLO: 95% of requests < 300 ms

Think of SLO as:

🎯 “What performance are we aiming for?”

SLO drives:

  • Architecture decisions

  • Reliability engineering

  • Investment priorities

  • Alerting thresholds

🔹 3️⃣ SLA – Service Level Agreement

Definition:A contractual commitment to customers or business.

It includes:

  • Financial penalties

  • Legal implications

  • Compensation clauses

Example:

  • Bank guarantees 99.9% uptime for internet banking

  • If breached → penalty or fee waiver

Think of SLA as:

📜 “What we legally promise externally.”

SLA is usually slightly lower than SLO.

Why?Because you need a buffer.

🔹 4️⃣ Error Budget

This is where modern reliability thinking becomes powerful.

If your SLO is:

99.95% availability per month

That means:

Total allowable downtime = 100- SLO for availability=100-99.95= 0.05%

In a 30-day month:

  • Total minutes = 43,200

  • 0.05% = ~21.6 minutes

That 21.6 minutes is your error budget.

Think of error budget as:

💰 “How much failure we can afford.”

If you exceed it:

  • Freeze releases

  • Prioritize stability

  • Reduce change velocity

If you’re within budget:

  • Continue innovation

  • Deploy faster

This balances speed vs reliability.

🔥 Simple Comparison Table

Term

What It Is

Audience

Example

SLI

Measured metric

Engineering

99.97% API success

SLO

Target for metric

Internal goal

≥ 99.95% success

SLA

Legal commitment

Customer/Business

99.9% uptime guarantee

Error Budget

Allowed failure

Engineering governance

21 mins downtime/month

🏦 Banking Context Example (Core Banking)

  • SLI: Core transaction success rate

  • SLO: ≥ 99.99%

  • SLA: 99.9% uptime to customers

  • Error budget: ~4.3 minutes/month

🧠 Architect-Level Insight

In modern banks:

  • SLO drives engineering culture

  • SLA drives legal exposure

  • Error budget drives release governance


🏦 1️⃣ SLO Model Specifically for Core Banking

When designing SLOs for Core Banking, we don’t treat everything equally.

We classify by criticality tier.

🔴 Tier 0 – Real-Time Financial Transactions (Highest Criticality)

Examples:

  • Fund transfer (IMPS/NEFT/RTGS)

  • Account debit/credit

  • Balance update

  • ATM withdrawal

SLIs

  • Transaction success rate

  • End-to-end latency

  • Data consistency rate

  • Reconciliation mismatch rate

SLO

  • Success rate ≥ 99.99%

  • 95th percentile latency < 500ms

  • Data inconsistency < 0.001%

  • Zero data loss

Error Budget

99.99% availability = ~4.3 minutes/month

If error budget exhausted:

  • Release freeze

  • Incident RCA within 24 hrs

  • Executive review

🟠 Tier 1 – Customer Channels (Mobile / Internet Banking)

SLIs

  • Channel availability

  • Login success rate

  • API latency

SLO

  • Availability ≥ 99.95%

  • Login success ≥ 99.9%

  • API p95 latency < 800ms

Error budget ≈ 21 minutes/month

🟡 Tier 2 – Back Office / Batch Processing

SLIs

  • EOD batch completion time

  • Report generation success rate

SLO

  • 100% batch completion before 6 AM

  • ≥ 99.5% job success rate

🧠 Architectural Controls to Meet These SLOs

  • Active-active DC setup

  • Synchronous DB replication

  • Circuit breakers

  • Retry patterns

  • Graceful degradation

  • Observability with golden signals (latency, traffic, errors, saturation)

SLO must influence architecture — not just monitoring.

🔥 2️⃣ “Is 99.95% Enough for Banking?”


99.95% sounds high. Why isn’t that enough?

“99.95% translates to roughly 21 minutes of downtime per month. For core financial posting systems, even 5–10 minutes during peak business hours can cause regulatory exposure, reconciliation impact, and reputational damage.

For customer channels, 99.95% may be acceptable.For core transaction posting — it is not.

We tier SLOs based on business criticality rather than applying a flat percentage across the bank.”


Then why not 100%?

“100% reliability is economically impractical. Achieving 100% requires extreme redundancy and operational cost. Instead, we define error budgets aligned to business risk appetite and regulatory exposure.”


What happens when you exhaust error budget?

“We automatically slow down release velocity, prioritize reliability engineering, and escalate to architecture governance. It becomes a business decision, not just an engineering one.”

That’s the maturity answer.

📊 3️⃣

Enterprise Reliability Governance Framework

1️⃣ Business Tiering

  • Tier 0: Core Financial Posting – 99.99%

  • Tier 1: Channels – 99.95%

  • Tier 2: Back Office – 99.5%

2️⃣ Defined SLIs

  • Availability

  • Latency (p95 / p99)

  • Error rate

  • Data integrity

3️⃣ Error Budget Governance

  • Defined per service

  • Linked to release velocity

  • Breach triggers:

    • RCA

    • Change freeze

    • Architecture review

4️⃣ SLA Alignment

  • SLA < SLO (buffer maintained)

  • Financial penalties modeled

  • Risk quantified

Reliability is a governed risk decision, not an afterthought metric.

🧠

“As an Enterprise Architect, I don’t define uptime percentages. I design reliability as a risk-aligned, business-tiered governance model with measurable error budgets.”


🎯 Step 1: Define What Reliability Means in Banking

In banking, reliability means:

  • No data loss

  • No duplicate transactions

  • No financial inconsistencies

  • High availability (99.99%+)

  • Predictable performance

  • Fast recovery from failure

So reliability = Correctness + Availability + Resilience + Recoverability

🏗 Step 2: Reliability at Each Layer

We ensure reliability across 7 layers.

1️⃣ Infrastructure Reliability

In Cloud (Digital Layer)

  • Multi-AZ deployment

  • Auto-scaling groups

  • Managed Kubernetes (AKS / EKS)

  • Load balancers with health checks

  • Distributed Kafka cluster (replication factor ≥ 3)

On-Prem (CBS Side)

  • Active-passive or active-active CBS

  • Database replication

  • Redundant network paths

  • Dual firewalls

2️⃣ Service-Level Reliability (Microservices)

Each microservice must:

✅ Be Stateless

No session stored locally.

✅ Use Circuit Breaker

If CBS is slow:

  • Stop calling

  • Return fallback response

  • Prevent cascading failure

✅ Timeouts + Retries

  • Set strict timeout (e.g., 2s)

  • Retry with exponential backoff

  • Max retry threshold

✅ Bulkhead Pattern

Separate connection pools for:

  • CBS calls

  • LOS calls

  • LMS calls

Prevents one failure from affecting entire system.

3️⃣ Data Reliability

This is most critical in banking.

🔐 Idempotency

Every request includes:

X-Idempotency-Key = LoanID or TransactionID

Prevents duplicate loan creation.

🔄 Outbox Pattern

When service updates DB:

  1. Save business data

  2. Save event in outbox table

  3. Background process publishes event to Kafka

Guarantees:

  • No event loss

  • No partial update

📦 Kafka Reliability

  • Replication factor ≥ 3

  • Acknowledgment level = ALL

  • Dead letter queues for failed messages

  • Consumer offset tracking

4️⃣ Transaction Reliability (Saga Pattern)

Loan creation involves:

  1. Loan approval

  2. CBS account creation

  3. LMS loan schedule

  4. Disbursement

We use Orchestrated Saga Pattern:

If CBS fails:

  • Compensate → mark loan as FAILED

  • Do not proceed to LMS

This prevents inconsistent state.

5️⃣ Hybrid Network Reliability

Between Cloud ↔ On-Prem:

  • Private connectivity (ExpressRoute / MPLS)

  • Mutual TLS

  • Retry logic at adapter

  • Secondary failover endpoint

If primary CBS endpoint fails:

  • Switch to secondary

6️⃣ Monitoring & Observability

Reliability without visibility is impossible.

Metrics

  • API latency

  • Error rate

  • CBS response time

  • Kafka lag

Tracing

Track full journey:

User → Gateway → LoanSvc → Adapter → CBS

Alerting

  • SLA breach alerts

  • Kafka consumer lag alerts

  • DB replication lag alerts

7️⃣ Recovery & Reconciliation

Even with best design, failures happen.

So we implement:

Near Real-Time Reconciliation

Digital vs LOS vs LMS vs CBS comparison.

End-of-Day ARB

Financial reconciliation.

Replay Capability

Kafka allows replaying events from offset.

Manual Override Dashboard

Ops team can:

  • Retry

  • Reconcile

  • Re-trigger events

🛡 Reliability Example Scenario

Scenario:

Loan account created in CBS but response lost.

Without reliability:→ Duplicate loan risk.

With reliability:

  1. Adapter uses idempotency key.

  2. If retry happens:

    • CBS detects duplicate

    • Returns existing loan account

  3. Reconciliation confirms consistency.

No financial corruption.

📊

How will you ensure reliability in hybrid banking architecture?

“I design reliability across infrastructure, service, data, and operational layers. I ensure stateless microservices with circuit breakers, idempotent APIs, outbox-based event publishing, Kafka replication, saga-based transaction management, and continuous reconciliation between digital and core systems. Additionally, we implement observability and automated recovery mechanisms to maintain 99.99% availability.”

That answer shows maturity.

🧠 Bonus: Reliability Pyramid

Infrastructure Stability        ↓Service Resilience        ↓Data Consistency        ↓Transaction Integrity        ↓Monitoring & Recovery

 
 
 

Recent Posts

See All
RFP PRE/POST-PROPOSAL SUBMISSION FLOW

🏆 1. The 5 Pillars to Win a Large Strategic Deal 1. Understand the Client Better Than They Do 👉 Don’t just read RFP — decode it What is their real problem ? What is driving this deal? (compliance, c

 
 
 
DIGITAL LENDING RFP Solution

🎯 RFP Proposal SOLUTION PRESENTATION – DIGITAL LENDING (WITH COLOR-CODED ARCHITECTURE) 1️⃣ Opening “Thank you for the opportunity. I’ll walk you through our approach to building a next-generation dig

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page