🔄HA & DR
🔄 HA vs DR — Core Concepts
›Two Different Problems
| High Availability (HA) | Disaster Recovery (DR) | |
|---|---|---|
| Scenario | Pod crashes, node fails, AZ down | Entire region unavailable |
| Goal | Zero downtime during partial failure | Recover from total failure |
| RTO | Seconds to minutes | Minutes to hours |
| Solution | Multiple replicas, anti-affinity, PDB | Multi-region, backups, runbooks |
| Cost | Medium (+50-100% infra) | High (+100-200% infra for active-passive) |
RTO/RPO/MTTR + CAP theorem
☸️ HA in Kubernetes
›PDB, anti-affinity, topology spread, health probes
🗄️ Database HA
›PostgreSQL HA, RDS Multi-AZ, backup strategy
🌍 Disaster Recovery
›DR Tiers
| Strategy | RPO | RTO | Cost | Use when |
|---|---|---|---|---|
| Active-Active | ~0 | ~0 | 2x | RTO/RPO requirements are seconds |
| Active-Passive (warm) | Minutes | 5-15 min | 1.5x | Business-critical, can afford 15 min downtime |
| Backup + Restore | Hours | 1-4 hours | 1.1x | Non-critical, cost-sensitive |
Velero K8s backup, DR runbook, chaos engineering
🎯 Interview Questions
›HA/DR · ENGINEER
What is the difference between RTO and RPO? Give a concrete example.
RTO (Recovery Time Objective) is the maximum acceptable time your system can be down after a failure. It answers: how long until we must be back online? RPO (Recovery Point Objective) is the maximum acceptable amount of data that can be lost. It answers: how old can our last good data backup be? Concrete example: payment processing system. Business decides: we cannot afford to be down more than 15 minutes (RTO=15min) and we cannot lose more than 1 minute of transaction data (RPO=1min). These requirements drive architecture decisions: RTO of 15 minutes means you need a warm standby that can be promoted quickly — not a cold backup that takes 2 hours to restore. RPO of 1 minute means you need synchronous or near-synchronous replication — daily backups would give RPO of 24 hours. Lower RTO and RPO = higher infrastructure cost. A system with RTO=0 and RPO=0 (no downtime, no data loss) requires active-active multi-region architecture — very expensive. Chose RTO and RPO based on business impact of downtime versus cost of HA infrastructure.
HA/DR · ARCHITECT
How do you design a highly available application on Kubernetes?
HA in Kubernetes requires multiple layers working together. Application layer: minimum 3 replicas, never 1. Pod Disruption Budget ensuring at least 2 pods always running during disruptions. Proper liveness and readiness probes so Kubernetes knows when a pod is unhealthy. Anti-affinity rules spreading pods across availability zones — if all pods are in AZ-A and AZ-A goes down, you have zero replicas. Topology spread constraints are the modern way to enforce this. Infrastructure layer: multiple nodes across multiple AZs. Node auto-scaling with cluster autoscaler. Database layer: PostgreSQL with streaming replication, connection pooling with PgBouncer for resilience during failovers. Networking layer: services with session affinity disabled (stateless pods), graceful shutdown with terminationGracePeriodSeconds matching your request timeout. The test: can you drain any single node without downtime? kubectl drain node --ignore-daemonsets. If this causes alerts or errors, your HA is incomplete. Run this test monthly in staging.
HA/DR · PRODUCTION
Production database just failed. Walk through your incident response.
Structured response: first 2 minutes — assess not act. Is this a node failure (standby should auto-promote), network issue (routing problem), or true data loss? Check: db pod status in kubectl, CloudWatch/Azure Monitor for the RDS instance status, application error logs to understand when the issue started. Minutes 2-5 — trigger automatic failover if not already happening. For RDS Multi-AZ: failover is automatic (30-60 seconds). Monitor: aws rds describe-events to see failover progress. For self-managed PostgreSQL with Patroni: check patronictl -c /etc/patroni/patroni.yml list — it shows cluster state and should show new leader. Minutes 5-15 — verify applications reconnected. Applications with connection pooling (PgBouncer) handle failover transparently. Applications with direct connections may need restart. Check application health endpoints. Update incident status channel. Minutes 15-30 — if automatic failover did not happen, manual failover. For RDS: aws rds reboot-db-instance --force-failover. Post-incident: run postmortem. Was the backup restoration tested recently? Did the failover time meet RTO? Update runbook based on what took longer than expected.
Continue Learning