Do SLAs, Error Budgets, and Availability Metrics Include Maintenance Windows?
š§ Do SLAs, Error Budgets, and Availability Metrics Include Maintenance Windows?
When it comes to service reliability, maintenance windows can be a gray area. Whether you're tracking uptime, setting SLOs, or managing customer expectations through SLAs, the question often comes up:
āShould scheduled maintenance count against our SLA? What about our error budget or availability metrics?ā
Letās unpack how scheduled (and unscheduled) maintenance affects your SLAs, error budgets, and availability calculations ā and what best practices look like.
š SLA: Do Maintenance Windows Count?
Service Level Agreements (SLAs)Ā are typicallyĀ contractual commitmentsĀ made to customers, promising a certain level of service availability (e.g., 99.9% uptime).
ā Ā Planned Maintenance Is Usually Excluded
Most SLAsĀ exclude scheduled and communicated maintenance windowsĀ from downtime calculations. That means if:
-
Maintenance was planned,
-
Properly communicated in advance (often 24ā72 hours), and
-
Done within agreed-upon time windows (e.g., off-peak hours),
ā¦it usually doesĀ not count against the SLA.
ā Unscheduled or Overrun Maintenance May Count
However, if:
-
The maintenance wasn't properly communicated,
-
It ran longer than scheduled,
-
It was done during peak usage without approval,
ā¦itĀ can count as downtimeĀ and lead to SLA violations or service credits.
šÆ Error Budgets: Are They Affected by Maintenance?
Error budgetsĀ represent the amount of failure or unreliability tolerated over a period, based on anĀ SLOĀ (Service Level Objective). If your SLO is 99.9% uptime per month, your error budget is aboutĀ 43.2 minutesĀ of allowed downtime.
š«Ā Planned Maintenance Usually Doesnāt Burn Budget
If maintenance is pre-approved and doesn't disrupt users, itās typically excluded from the error budget ā especially in SRE frameworks that prioritizeĀ user-perceived reliability.
ā Ā User-Impacting Events Do Burn Budget
If users are affected ā even during scheduled maintenance ā some orgs choose to count it against the error budget. The key question is:
āWould a user notice or be blocked?ā
If yes, it probably burns error budget. If no, it likely doesn't.
š Availability: Does Maintenance Affect It?
AvailabilityĀ is the actualĀ measured uptimeĀ of your service over a specific period ā typically expressed as a percentage like 99.95%.
Whether maintenance counts against it depends onĀ how you define availabilityĀ in your metrics.
šøĀ User-Facing Availability
If your availability metric reflectsĀ user impact, planned maintenance thatās properly communicated isĀ often excluded.
š¹Ā System-Level (Strict) Availability
If you're measuring raw service uptime (e.g., pings, monitoring checks), all downtime ā including planned maintenance āĀ might be included.
š Summary Table
Maintenance Type | Counts Toward SLA? | Burns Error Budget? | Affects Availability? |
---|---|---|---|
Planned & Communicated | ā Usually Not | ā Usually Not | ā* If defined that way |
Unplanned or Overrun | ā Yes | ā Yes | ā Yes |
Poorly Communicated | ā Yes | ā Yes | ā Yes |
š§ Best Practices
-
šĀ Define everything explicitly: Make sure SLAs, SLOs, and availability metrics clearly state how maintenance is handled.
-
š£Ā Communicate proactively: Proper notification is key to excluding maintenance from SLAs and error budgets.
-
šÆĀ Focus on user impact: Base decisions on whether users are affected, not just whether systems are up or down.
-
š¤Ā Align across teams: Ensure engineering, product, and legal are aligned on how you track and report service health.