Active Active Set Up AWS-Azure

Anand Nerurkar
3 days ago
6 min read

This is a classic active-active multi-cloud design problem with aggressive RPO/RTO (5 min / 5 min). Below I’ll give a clear, practical, step-by-step blueprint you can implement (or hand to an SRE/Cloud Architect) — covering routing, health checks, data replication options, session/state, failover automation, runbooks, testing, and what to watch out for.

Goals / constraints (from your note)

Active-active across West India (Azure) and South India (AWS).
Users routed to nearest datacenter (south→AWS, west→Azure).
Both sites fully operational; if one fails, the other accepts traffic seamlessly.
RPO = 5 minutes, RTO = 5 minutes.

High level approach (summary)

Use DNS/geolocation routing so users go to the nearest cloud.
Keep both app stacks live (stateless app servers; separate DB strategy).
Provide near-real-time data replication with <5min lag — choose either a geo-distributed DB (multi-master or global tables) or an event/CDC-based near-sync replication pipeline.
Use low-TTL DNS + health probes so failover switches quickly.
Keep session/state, queues, and object storage replicated or centrally accessible.
Build automation & runbooks to cut RTO to ≤5min and test regularly.

Step-by-step detailed plan

1) DNS & user routing (traffic management)

In Azure (West India) expose frontends via an Azure-facing public endpoint (App Gateway / Front Door / ALB). In AWS (South India) expose frontends via AWS public endpoints (ALB + CloudFront/Global Accelerator if used).
Use DNS geolocation to steer users:
- For queries that originate in South India point to the AWS endpoint.
- For queries that originate in West India point to the Azure endpoint.
- Implementation: use Route53 Geolocation routing for records that should route to AWS, and use Azure Traffic Manager geographic routing for Azure. (Operationally you’ll pick one global DNS provider that supports multi-cloud geolocation, or coordinate Route53 & Azure DNS via a federated approach.)
Configure DNS TTL = 30–60 seconds (tradeoff: faster failover vs caching). Recommend start with 60s.
Configure health checks (DNS provider health checks):
- Frequency: every 10–30 seconds.
- Failover trigger: e.g., 3 consecutive failed checks.
- Probe endpoint: lightweight /healthz that tests app + DB reachability.
Use weighted routing to gradually shift traffic during test/failover if needed.

Important note: DNS is eventually consistent; clients / intermediaries may cache records longer than TTL. Mitigate with: short TTL, client retry logic, and application-level idempotency.

2) Network & control plane connectivity

Establish a secure backbone for replication/control:
- Azure ExpressRoute ↔ AWS Direct Connect via a co-location partner, or set up VPN tunnels between Azure VNet and AWS VPC(s).
- Use private connectivity for database replication traffic to avoid public internet as the primary replication path.
Use strict IAM/service principals and cross-account roles for the replication/control automation.

3) Application layer (stateless apps)

Design app servers as stateless: all persistent state stored in replicated services (DB, caches, object store).
Deploy identical app code & containers in both clouds; keep CI/CD pipeline to push simultaneous releases to both clouds.
Implement idempotent endpoints and correlation IDs for requests (to prevent duplicate processing during retries/replicas).
Health endpoint should validate connectivity to local dependencies and return granular status.

4) Data replication strategy (the most critical part)

You have two main choices — choose based on whether your database supports geo-distributed multi-master writes and on your latency tolerance.

Option A — True multi-region multi-master DB (recommended if you need active-active writes)

Use a database engineered for geo distribution & conflict resolution (examples: CockroachDB, YugabyteDB, Cosmos DB with multi-master writes, or a commercial DB that supports active-active).
Advantages: low application complexity, writes accepted locally, strong/causal consistency options.
Requirements: deploy DB cluster nodes across both regions, ensure quorum/consensus config supports geo latency, monitor commit latency.
Must test conflict resolution semantics and set replication/consistency level that meets RPO/RTO.

Option B — Active-active at application level with near-real-time replication (if DB is single-master)

If your primary DB is single-master (e.g., RDS, Azure SQL) and you cannot have cross-cloud multi-master:

Make one region a writable primary for a given shard, or implement sharding by customer geography to minimize cross-writes.
Use CDC (Change Data Capture) + an event log to replicate changes to the other region:
- Tools: Debezium (CDC) -> Kafka (local) -> MirrorMaker / Replicator -> target DB apply.
- Or use managed streaming services (MSK / Confluent + replication).
Target replication lag < 5 minutes. Monitor lag and raise alerts if > 1–2 minutes.
For writes that may land in the “wrong” region, either reject (force client to write to geo primary) or store writes in an inbound queue and replay with conflict resolution.
Use last-write-wins or application business rules for conflicts — prefer deterministic resolution.

Important tradeoff: synchronous cross-cloud replication to guarantee zero RPO increases latency; usually not acceptable. Near-sync (asynchronous) with <5min lag is typical for RPO=5min.

5) Caches, sessions & state

Don’t keep session affinity on app servers. Use a distributed session store that supports geo-replication:
- Options: Redis Enterprise Active-Active (CRDT), DynamoDB Global Tables, Cosmos DB, or a shared JWT session token (stateless sessions).
If you must use in-memory session, use client sticky sessions only for short operations and replicate critical session writes to the distributed store.

6) Message queues & event bus

Use a replicated event streaming layer: Kafka across regions with MirrorMaker 2 (or Confluent Replicator).
Ensure consumer offsets are mirrored, handle duplicates and idempotency.
Design producers to persist events locally first, then replicate.

7) Object storage (files, blobs)

Mirror object storage across clouds:
- From AWS S3 <-> Azure Blob via cross-replication jobs (e.g., object replication tools, lambda/functions, or commercial replication solutions).
Use CDN in front (CloudFront / Azure CDN / Akamai) with origin failover to minimize user impact and edge caching.

8) Health monitoring & failover automation

Instrument metrics:
- Replication lag (seconds/minutes), DB commit latency, queue lag, 4xx/5xx rates, request latency, error budget.
Set alerts:
- Replication lag > 60s (warning), > 300s (critical).
- App error rate spikes, health probe failures.
Automate failover:
- DNS provider updates (health check triggers DNS change).
- Or use a central orchestrator that modifies weights/records via API on health events.
Implement graceful draining and retry logic on clients.

9) Failover mechanics (how traffic shifts)

Health probe in each cloud monitors app + dependency reachability.
If South-AWS fails, Route53 health check marks AWS endpoint unhealthy -> Route53 serves Azure endpoint to South India clients (if geolocation rules are set to allow fallback). Similarly for Azure.
With TTL 60s and health check interval 10s + 3 failures trigger, clients should get new DNS within ~30–90s in practice. Application reconnection logic may add time.
For in-flight transactions that hit a failed region: use idempotency keys and application replay from event logs to ensure no data loss. If replication lag ≤5min, you meet RPO; if greater, manual reconciliation needed.

10) Runbook for an outage (concise)

Detect: Alert triggers -> SRE team notified.
Confirm: Check health endpoints, replication lag dashboards.
Automatic action: DNS provider already redirected traffic if health probe failed. If not, SRE triggers DNS failover via script.
Cutover checks: Verify the accepting region has fresh data (replication lag < 5 min). If not, enable write operations only if data parity acceptable.
Communicate: Status page + customer comms.
After recovery: Reverse replication (replay events) and reconcile conflicts using deterministic rules.

11) Testing & verification (must do frequently)

Chaos testing: Simulate full region failure (both planned and unplanned).
Disaster recovery drills: Verify RTO < 5 min and RPO < 5 min — measure times, document gaps.
Failback test: Restore primary and perform data reconciliation.
Daily/weekly synthetic transactions to verify end-to-end replication.

12) Key configuration recommendations & numbers

DNS TTL: 30–60 seconds.
Health check frequency: 10–30s. Failover threshold: 3 failed checks.
Replication monitoring alert: lag > 60s (warning), > 300s (critical).
Queue retention: long enough to replay (>=24h) to be safe.
Idempotency keys: mandatory for operations that change state.
Correlation IDs: attach to every request and event.

13) Monitoring & SLOs to track

Replication lag (seconds).
Successful commits per second and commit latency.
% requests served by local region vs cross-region.
Error rate and latency (p95/p99).
DNS propagation/failover time measured during tests.

14) Common pitfalls & mitigations

DNS caching — clients & ISP caches ignoring TTL. Mitigate with client retry logic and application idempotency.
Conflict resolution — if writes occur in both regions, need deterministic resolution. Prefer application rules, vector clocks, or CRDTs.
Synchronous replication over WAN — may hurt user latency; usually avoid cross-cloud sync.
Hidden dependencies — some services (certificates, third-party APIs) may not be replicated. Inventory all dependencies.
Credential sprawl & IAM mistakes — centralize secrets & rotation.
Cost — multi-cloud active-active doubles some infrastructure costs; justify with SLA needs.

15) Recommended architectures depending on your constraints

If you can change DB: Use a geo-distributed DB that supports active-active (CockroachDB / Yugabyte / Cosmos DB multi-master) — simplest application model.
If you must keep current DB: Use CDC -> Kafka -> Replicator approach with strict monitoring and sharding to minimize cross-region writes.
For sessions: Use Redis Enterprise Active-Active or stateless JWTs.
For global edge: Use CDN + Global Accelerator / Front Door to reduce user latency and provide origin failover.

16) Example simple failover timeline (optimistic)

t0: Region failure detected by health probes.
t0 + 30s: DNS health check registers 3 failures → DNS provider marks endpoint unhealthy.
t0 + 60s: DNS change propagated to majority of clients (TTL dependent).
t0 + 60–120s: Most clients reconnect to other region.
t0 + 2–5 mins: Services stabilize; remaining sessions/timeouts handled by retry/idempotency.This fits RTO=5min if all pieces are tuned and replication lag ≤5min.

Quick checklist you can use right away

Short TTL (30–60s) set on DNS records.
Health probe endpoints implemented and configured in DNS provider.
Private connectivity between clouds for replication.
Determine DB strategy: multi-master vs CDC.
Implement idempotency & correlation IDs.
Replicated session store configured.
Kafka/queue replication in place.
Monitoring + alerts for replication lag.
Automated failover scripts + runbook.
Regular DR drills scheduled.