Your SLOs May Be Lying: Here's How to Fix Service Reliability

Your SLOs May Be Lying: Here’s How to Fix Service Reliability

Dashboard showing SLO compliance breakdown by customer tier and region in New Relic observability platform

Your dashboard is green. Your uptime SLO sits at a healthy 99.95%. Your error budget looks fine. But are your users actually getting a reliable service? According to New Relic Senior Product Manager Mafalda Verde, the answer may be no, and the data you trust most might be the problem.

In a guide published May 18, 2026, Verde warns that a single top-level SLO service reliability metric works like a watermelon: green on the outside, hiding serious red within. A 99.95% global uptime score can easily mask a 98% uptime for enterprise customers or for an entire geographic region. Averages, in other words, bury outliers.

The fix requires two strategies used together.

The first is isolating clean signal from planned noise. Maintenance windows, scheduled downtime for deployments or upgrades, routinely eat into error budgets without reflecting any real failure. This creates three distinct problems. Alert fatigue sets in as alarms fire for expected downtime. Reliability data becomes distorted, making it hard to separate real incidents from planned changes. Worse still, engineering teams get penalised even when they did everything right. The solution is simple: instruct your observability platform to exclude those windows from SLO calculations entirely. In New Relic, teams can schedule one-time or recurring exclusion windows. This leaves the error budget as a true measure of unplanned incidents only.

The second strategy is breaking down, or faceting, the SLO to reveal what is actually happening beneath the global number. Rather than maintaining dozens of separate SLOs, teams can split a single SLO’s data by attributes already present in their telemetry. Relevant dimensions include infrastructure attributes like awsRegion or kubernetesClusterName, customer attributes like customerTier or subscriptionLevel, and technology attributes like deviceType or appVersion.

Enabling SLO faceting in New Relic immediately produces a compliance and error budget breakdown per segment. A team might discover, for instance, that their us-west-1 region is underperforming. Alternatively, users on the newest app version could be quietly experiencing a far worse service. That granular view makes it possible to fix problems before they grow. Engineering effort flows where it is most needed. Targeted alerts fire only when a specific, high-value segment, such as the Enterprise customer tier, is actually at risk.

Together, both strategies sharpen the entire reliability management workflow. A team can use the faceted view to spot a struggling region, then apply a maintenance window to deploy a targeted fix, all without burning that region’s remaining error budget unnecessarily.

Verde describes this as what mature SLO service reliability management looks like: moving past the false comfort of a single green number toward an honest, actionable view of system performance, one where a green dashboard truly means green for everyone.

Engineering teams ready to improve their SLO service reliability approach can explore the New Relic Service Level Management documentation and the full New Relic observability platform.