The need for a new more modern Reliability Stack

Google SRE Book

In this post, we consider why it may be time to abandon the service-level management approach that has been strongly advocated for in the Google Site Reliability Engineering book series. But before proceeding, let us consider some excerpts from the O’Reilly Implementing Service Level Objectives Book to set the stage for a re-examination in the context of modern environments.

“…a proper level of reliability is the most important operational requirement of a Service…There is no universal answer to the question of where to draw the line between SLI and SLO, but it turns out this doesn’t really matter…SLIs are the single most important part…You may never get to the point of having reasonable SLO targets…The key is to have metrics that enable you to make statements that are either true or not.”

Service Level Indicators

At the bottom of the Google Reliability Stack are service level indicators (SLI). These are measures of some aspects of the service quality, such as availability, latency, or failures. As indicated above, the Google Site Reliability Engineering teaching treats them or transforms them into binary values. The service is accessible or not. The service is unsatisfactorily slow or not. The response is good or not. There is no grey area here because of the need to turn such measures into a ratio of bad to good events.

Service Level Objectives

Much confusion arises regarding the next layer in the Reliability Stack, where targets are defined – Service Level Objectives. This confusion occurs when SRE teams make an SLI measure to achieve a goal, which you would expect to be defined by an objective. In such cases, the difference between an SLI and SLO pertains mostly to windowing – some aspect of the ratio distribution is compared with some target value over a specified time interval.

Error Budgets

Layered on top of the SLOs are the Error Budgets used to track the difference between the indicator and the objective, again over time, mostly at a different resolution suitable for policing and steering change management. Measures on measures on measures – and all quantitative. We know you’re thinking this sounds like a job for machine learning. It is not. We need to keep humans in the loop more than ever, and offloading is not a great strategy. That would be more akin to putting lipstick on a pig. The solution can’t be the problem.


Failing Fast

Unfortunately, little critical thinking enters engineers’ minds once Google is mentioned as the originator of something, even with serious design and operational issues. Google itself is open and honest about this in stating that many organizations adopting a service level objectives approach to reliability will not even get beyond defining service level indicators. Even then, service level indicators and objectives are invariably limited to edge entry points to a service or system of services. Many organizations don’t get off to a good start along this journey because beneath the service level indicator layer is the messy and increasing bloated world of metrics, traces, and logs. The notion of Service and a Consumer of such is all but lost amidst this data fog.

Abandoning Simplicity

Once an organization manages to get past translating data into information suitable for indicators and objectives, they face the most significant challenge to this whole model, and that is for every service level indicator, there is at least one service level objective, and for every service level objective, there is at least one error budget. This is a stack of similarly sized layers. Things get completely out of control and extremely costly (except maybe for the likes of Google) when multiple service level objectives are defined for a service level indicator. There are 581 pages in the O’Reilly book that should be a sufficient warning in itself that simplicity is not to be found here.

Scaling with Spreadsheets

Here is where site reliability engineering becomes a profession of glorified spreadsheet data entry clerks. And because there is so much tight coupling between each layer, any change in one layer ripples throughout, creating an unwieldy maintenance effort. To reduce the management burden, consolidating and compressing the model in systems is complicated by relying on quantitative measures rather than qualitative. This does not scale unless you’re Google.

Dysfunctional Doctrine

What is strikingly odd about the whole site reliability engineering effort where the human user is placed front and center is that on the whole, there is nothing particular humane about it, except that Google and those who have adopted this approach focus exclusively in their doctrine on measuring and monitoring primarily at the edge and solely concerning human-to-machine interactions. System engineers need a new model that scales both up and down and can be applied effectively and efficiently at the edge and within a system of services. Otherwise, there will be a reliability model for this and another model for that. We need to stop inadvertently building prisons of our doing.

Skipping with Signals

The lower layers’ cost and complications need to be significantly reduced to scale, realistically moving up and outwards to other parties. The service level management language needs to be significantly simplified; instead of talking about objectives in terms of four or five nines, operations’ attention should be on a standard set of signals (outcomes and operations) and an even smaller set of status values. The primary service management task for operations should be configuring the scoring of sequences of signals and status changes between services. Inter and intra-communication should near exclusively relate to the subjective view of meaningful service states. Just that!

Appendix A: Service Level Objective Examples

Google SRE
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms
90% of Get RPC calls will complete in less than 1 ms
99% of Get RPC calls will complete in less than 10 ms
99.9% of Get RPC calls will complete in less than 100 ms

Service A will have a subjective OK status 99% of the time.

OpenSignals has 16 built-in signals used to infer the status of a service, whereas, in the Google SLO specification above, there is only one signal referenced. Typically, there will be at least three service level indicators, and in turn, three objectives and error budgets.