Resiliency Best Practices

Anand Nerurkar
Oct 8, 2023
2 min read

Resiliency-

Ability of system to provide acceptable behaviour even when one or more parts of the system fail.

Engineering Best Practices

1. Build Resilient Architectures

· Capacity planning

· Identify the load for the system beyond which it may disrupt the system where we need to scale, so identifying the threshold that would trigger scale in and scale out with ASG

· Build resiliency within services to take care of –(make use of resilient 4j library to implement below)

o Retry on failure

o Expontial back off

o Bulkhead

o Rate limiter

o Time limiter

o Circuit breaker

o Chaos Engineering Principles

§ Build a hypothesis around steady state behaviour

§ Run experiments in production

§ Automate experiments to run continuously

§ Minimise blast radius

§ Tools to be used

· Chaos monkey – kill app services

· Latency monkey – introduce delay

· Chaos gorilla – kill az

· Chaos kang - kill region

§ Vary real world events

· Kill db

· Kill app server

· Increase traffic

· Make use of MIG

· Autoscaling/auto healing

· Run VMs in MIG behind global load balancing

· Make use of managed services as much as possible

· With Cloud Load Balancing/Traffic Manager, distributing traffic across instances of an application in a region or multiple region

· Make use of regional persistent disk

2. Reliability

Ability of system to recover from system/infrastructure/service interruption by dynamically acquiring computing resources to meet demands and mitigate disruptions such as misconfiguration , transient n/w failures

· Test recovery procedures

· Capacity planning

· Scale horizontally

· Automatically recover from failures

· Manage changes in automation

3. Operational Excellence

Include ability to run & monitor system to deliver business value & continuously improve supporting processes and procedures.

· Perform operation as a code ( Infrastructure as code – terraform scripting)

· Annotate documentation

· Make frequent, small, reversible changes

· Refine operation procedure frequently

· Anticipate failures

· Learn from all operational failures

4. Have the right data available

· Use Cloud Monitoring for monitoring

· Install logging agent to send logs to Cloud Logging

5. Be prepared for the unexpected (and changes)

· Enable Live Migration and Automatic restart when available

· Configure the right health checks

6. Disaster recovery(BCP) Upto date image copied to multiple regions

7. Security

Include ability to protect the system, assets and data while delivering the business value through risk assessment and mitigation strategy

· Implement strong identity foundation

· Enable traceability

· Apply security at all layers

· Zero trust model

· Automate security best practices

· Protect data in transits and at rest

· Keep people away from data

· Prepare for security events

· Protect system from Ransome malware attacks

o Implement dev-sec-ops model – Veracode integration

o Configure WAF

o Configure DDOS

8. RPO & RTO

a. How do we measure how quickly we can recover from failure?

b. RPO (Recovery Point Objective): Maximum acceptable period of data loss

c. RTO (Recovery Time Objective): Maximum acceptable downtime

i. Hot standby –

Automatically synchronize data

Have a standby ready to pick up load

Use automatic failover from master to standby

ii. Warm standby –

Automatically synchronize data

Have a standby with minimum infrastructure

Scale it up when a failure happens

iii. Snapshots/transaction logs-

Create regular data snapshots and transaction logs

Create database from snapshots and transactions logs when a failure happens

9. Backup & Recovery

· backup

· snapshot

· restore images

· restore DB/application

Resiliency Best Practices

Recent Posts

Comments