HA & DR — LearnwithVishnu

🔄HA & DR

BeginnerEngineerProductionArchitectHigh Availability and Disaster Recovery — RTO, RPO, Kubernetes HA, database failover

HA vs DR K8s HA Database HA Disaster Recovery Interview Q&A

🔄 HA vs DR — Core Concepts

›

Two Different Problems

	High Availability (HA)	Disaster Recovery (DR)
Scenario	Pod crashes, node fails, AZ down	Entire region unavailable
Goal	Zero downtime during partial failure	Recover from total failure
RTO	Seconds to minutes	Minutes to hours
Solution	Multiple replicas, anti-affinity, PDB	Multi-region, backups, runbooks
Cost	Medium (+50-100% infra)	High (+100-200% infra for active-passive)

RTO/RPO/MTTR + CAP theorem

☸️ HA in Kubernetes

›

PDB, anti-affinity, topology spread, health probes

🗄️ Database HA

›

PostgreSQL HA, RDS Multi-AZ, backup strategy

🌍 Disaster Recovery

›

DR Tiers

Strategy	RPO	RTO	Cost	Use when
Active-Active	~0	~0	2x	RTO/RPO requirements are seconds
Active-Passive (warm)	Minutes	5-15 min	1.5x	Business-critical, can afford 15 min downtime
Backup + Restore	Hours	1-4 hours	1.1x	Non-critical, cost-sensitive

Velero K8s backup, DR runbook, chaos engineering

🎯 Interview Questions

›

HA/DR · ENGINEER

What is the difference between RTO and RPO? Give a concrete example.

RTO (Recovery Time Objective) is the maximum acceptable time your system can be down after a failure. It answers: how long until we must be back online? RPO (Recovery Point Objective) is the maximum acceptable amount of data that can be lost. It answers: how old can our last good data backup be? Concrete example: payment processing system. Business decides: we cannot afford to be down more than 15 minutes (RTO=15min) and we cannot lose more than 1 minute of transaction data (RPO=1min). These requirements drive architecture decisions: RTO of 15 minutes means you need a warm standby that can be promoted quickly — not a cold backup that takes 2 hours to restore. RPO of 1 minute means you need synchronous or near-synchronous replication — daily backups would give RPO of 24 hours. Lower RTO and RPO = higher infrastructure cost. A system with RTO=0 and RPO=0 (no downtime, no data loss) requires active-active multi-region architecture — very expensive. Chose RTO and RPO based on business impact of downtime versus cost of HA infrastructure.

HA/DR · ARCHITECT

How do you design a highly available application on Kubernetes?

HA in Kubernetes requires multiple layers working together. Application layer: minimum 3 replicas, never 1. Pod Disruption Budget ensuring at least 2 pods always running during disruptions. Proper liveness and readiness probes so Kubernetes knows when a pod is unhealthy. Anti-affinity rules spreading pods across availability zones — if all pods are in AZ-A and AZ-A goes down, you have zero replicas. Topology spread constraints are the modern way to enforce this. Infrastructure layer: multiple nodes across multiple AZs. Node auto-scaling with cluster autoscaler. Database layer: PostgreSQL with streaming replication, connection pooling with PgBouncer for resilience during failovers. Networking layer: services with session affinity disabled (stateless pods), graceful shutdown with terminationGracePeriodSeconds matching your request timeout. The test: can you drain any single node without downtime? kubectl drain node --ignore-daemonsets. If this causes alerts or errors, your HA is incomplete. Run this test monthly in staging.

HA/DR · PRODUCTION

Production database just failed. Walk through your incident response.

Structured response: first 2 minutes — assess not act. Is this a node failure (standby should auto-promote), network issue (routing problem), or true data loss? Check: db pod status in kubectl, CloudWatch/Azure Monitor for the RDS instance status, application error logs to understand when the issue started. Minutes 2-5 — trigger automatic failover if not already happening. For RDS Multi-AZ: failover is automatic (30-60 seconds). Monitor: aws rds describe-events to see failover progress. For self-managed PostgreSQL with Patroni: check patronictl -c /etc/patroni/patroni.yml list — it shows cluster state and should show new leader. Minutes 5-15 — verify applications reconnected. Applications with connection pooling (PgBouncer) handle failover transparently. Applications with direct connections may need restart. Check application health endpoints. Update incident status channel. Minutes 15-30 — if automatic failover did not happen, manual failover. For RDS: aws rds reboot-db-instance --force-failover. Post-incident: run postmortem. Was the backup restoration tested recently? Did the failover time meet RTO? Update runbook based on what took longer than expected.

Continue Learning

🔥 Prometheus 📐 SLO ☸️ Kubernetes 🏠 All Topics