LearnwithVishnu
LearnwithVishnu
Basics → Production → Architect
← Home
🚨Incident Response
BeginnerEngineerProductionArchitectProduction incident management — severity, runbooks, blameless postmortems
FrameworkRunbookPostmortemToolsInterview Q&A

🚨 Incident Response Framework

Why structured incident response matters

Unstructured incident response: multiple engineers doing duplicate work, no one communicating to stakeholders, random restarts without understanding root cause, same incident happening again next month.

Structured incident response: clear roles, focused investigation, stakeholders informed, postmortem prevents recurrence.

SeverityDefinitionResponse timeResolve within
SEV1All users affected, production downImmediate (24/7)1 hour
SEV2Major features unavailable, significant impact15 minutes4 hours
SEV3Minor degradation, workaround available1 hour (business hours)24 hours
SEV4Cosmetic issues, no user impactNext sprintSprint cycle
Severity levels and team roles

📋 Incident Runbook

Detect → Assess → Mitigate → Resolve phases

📝 Blameless Postmortem

Postmortem template + 5 Whys + action items

🔧 Tools & Commands

Alerting tools + quick investigation commands + rollback

🎯 Interview Questions

INCIDENT · ENGINEER
What is the difference between an incident and a problem in ITSM?
In ITSM (IT Service Management, ITIL framework): an Incident is an unplanned interruption or degradation of service — something is broken right now. The goal is to restore service as fast as possible, root cause can wait. A Problem is the underlying cause of one or more incidents. Problem management investigates root causes to prevent future incidents. Example: Monday morning, payment service is down (Incident). The team restores service by restarting pods. Later that week, Problem management investigates why pods crash — discovers memory leak in a new library version. Fix the library to prevent future incidents. In DevOps practice we use simpler terminology: Incident (acute, restore now), Postmortem (root cause analysis, prevent recurrence). The ITSM distinction is still important at enterprise accounts (banks, telco, HPE-scale) where formal ITSM processes are required for compliance and change management.
INCIDENT · ARCHITECT
How do you build a blameless postmortem culture?
Blameless postmortems require a top-down commitment that the goal is system improvement not punishment. The foundational principle: engineers make the best decisions possible with the information available at the time. If the system was designed so that a single engineer's mistake causes a major outage, that is a system design problem, not a human failure. Practices: no names in root cause analysis — write about the system, not the person. Use passive voice: the deployment was triggered (not Vishnu triggered). Replace every instance of could have or should have with the system lacked. Focus on what information was available at decision time, not what we know now. Five whys goes five levels deep into system failures, never stops at human error. At HPE: we had a incident where a developer accidentally deleted a production namespace. Blameless analysis found: no RBAC preventing namespace deletion, no confirmation prompt for delete operations, no backup to restore from. Three system fixes. If we had blamed the developer: we would have fixed nothing and the next developer in a stressful 2am situation would make the same mistake with the same missing protections.
INCIDENT · PRODUCTION
Production is down. Walk through your first 15 minutes.
Structured response, no panic. Minute 0-2: acknowledge the alert (stops duplicate response). Quick assessment — is this real or monitoring glitch? Check if multiple signals correlate: alert firing + elevated error rate in Grafana + user reports in support channel. Minute 2-5: declare the incident in the incidents channel with: what is happening, who is affected, who is the Incident Commander. Start a shared document or Slack thread for the timeline. Minute 5-10: understand scope before acting. kubectl get pods -A grep -v Running. kubectl get events sorted by time. Check recent deployments — was there a deploy in the last hour? Check monitoring dashboards — when did the issue start exactly? Correlate with deployment history. Minute 10-15: identify the fastest path to service restoration, not root cause. If there was a recent deployment: roll it back immediately, even if you are not sure it caused the issue. Rolling back is safe. Continuing to investigate while users are affected costs more than a premature rollback. If no recent deployment: check pod health, scale up replicas, check database connectivity. Communicate status every 15 minutes to stakeholders even if there is no update. Silence during an incident is worse than bad news.
Continue Learning
📐 SLO🔄 HA & DR🔥 Prometheus🏠 All Topics
🤖
AI Assistant
Ask anything about this topic
👋 Hi! I have read this page and can answer your questions.

Try asking: "Explain this topic in simple terms" or "Give me an example" or ask any specific question.