📐SLO & Error Budget
📐 SLA, SLO, SLI — Core Concepts
›The most asked senior interview topic in SRE and DevOps
These three terms are related but distinct. Getting them confused in an interview signals you have not worked at the architectural level.
| Term | What it is | Who sets it | Example |
|---|---|---|---|
| SLI — Indicator | The actual measured value | Engineering (what can we measure?) | 99.3% availability this month |
| SLO — Objective | Internal reliability target | Engineering + Product | Must maintain 99.9% availability |
| SLA — Agreement | External customer contract | Business + Legal | We guarantee 99.5% or give credits |
Error Budget — the key insight
Error budget = 100% - SLO. It is the amount of unreliability you are allowed. It transforms reliability from a vague aspiration into a quantitative resource teams manage consciously.
| SLO | Error Budget/month | Allowed downtime/month |
|---|---|---|
| 99% | 1% | 7.3 hours |
| 99.9% | 0.1% | 43 minutes |
| 99.95% | 0.05% | 21.9 minutes |
| 99.99% | 0.01% | 4.4 minutes |
SLA/SLO/SLI concepts with real numbers
📊 Implementing SLOs in Prometheus
›SLOs are not just a concept — implement them in code
Recording rules calculate your SLI continuously. Alerts fire when the burn rate is too high — not when you have already exhausted the budget.
Availability SLO recording rules + burn rate alerts
⏱️ Latency SLOs
›Latency SLO — P99 budget + Grafana panels
📋 Error Budget Policy
›How error budgets change team behaviour
Without error budget policy: reliability is a vague concern, feature pressure always wins. With error budget policy: reliability has a measurable impact on feature delivery — teams automatically balance both.
Error budget policy + multi-burn-rate alerting
🎯 Interview Questions
›SLO · ENGINEER
Explain SLA, SLO, and SLI. Give a concrete example of each.
SLI is the actual measurement — a number you can query from monitoring. Examples: availability (percentage of successful HTTP requests in a time period), latency (P99 request duration), error rate (percentage of 5xx responses). SLO is your internal reliability target — what the engineering team commits to maintaining. Always stricter than the SLA to give a buffer. Example: availability SLO of 99.9% means the team alerts and acts if availability drops below 99.9% — that is 43 minutes of downtime allowed per month. SLA is the external customer contract with commercial consequences. Example: availability SLA of 99.5% means if a customer experiences more than 3.6 hours of downtime in a month, they receive service credits. The key relationships: SLI is what you measure. SLO is what you aim for (internal). SLA is what you promise (external). SLO is always stricter than SLA — the gap between SLO and SLA is your safety margin. If your SLO and SLA are the same, any reliability incident immediately breaches the customer contract with no warning.
SLO · ARCHITECT
What is an error budget and how does it change engineering behaviour?
Error budget is the amount of unreliability you are allowed under your SLO. If your SLO is 99.9% availability, your error budget is 0.1% — which equals 43.2 minutes per month. The error budget transforms reliability from a vague goal into a quantitative resource that teams manage consciously. When budget is healthy: the team has permission to move fast, take risks, do large deployments, experiment. The error budget is the business saying it is acceptable to have occasional brief outages in exchange for faster feature delivery. When budget is nearly exhausted: the team must slow down, freeze risky changes, focus on reliability. This is not management imposing a freeze — it is an automatic consequence of the team's own reliability target. This changes engineering behaviour in three ways: developers become invested in reliability because reliability problems directly stop feature work. Reliability work gets prioritised automatically without management intervention. Decisions about deployment timing become data-driven rather than opinion-driven.
SLO · PRODUCTION
Your service SLO is 99.9% availability. You have had 50 minutes of downtime this month with 5 days left. What do you do?
Error budget calculation: 99.9% SLO, 30-day month = 43.2 minutes allowed. 50 minutes used = budget exhausted 6.8 minutes ago. Immediate actions: one, freeze all non-critical deployments for the remaining 5 days of the month. Any deployment risks additional downtime that the budget cannot absorb. Two, raise a reliability review — understand what caused each incident. Was it the same root cause multiple times? Was it deployment-related? Infrastructure? Three, if SLA is 99.5% (3.6 hours), check if you have breached it. 50 minutes is well within SLA. If SLA was also breached, initiate customer communication and service credit process. Four, write an incident postmortem for any incidents contributing to the downtime. Focus on systemic fixes not blame. Five, brief the engineering team — reliability work is the priority for the next 5 days. Review what automated testing, monitoring, or deployment safeguards could have prevented the incidents. First of next month: the budget resets. Use the postmortem action items to improve reliability. Set up error budget burn rate alerts so next time you get early warning before the budget is exhausted.
Continue Learning