The need for a new more modern Reliability Stack

In this post, we consider why it may be time to abandon the approach to service level management that has been strongly advocated for in the Google Site Reliability Engineering book series. But before proceeding, let us consider some excerpts from the O’Reilly Implementing Service Level Objectives Book to set the stage for a re-examination in the context of modern environments.

“…a proper level of reliability is the most important operational requirement of a Service…There is no universal answer to the question of where to draw the line between SLI and SLO, but it turns out this doesn’t really matter as long as you’re using the same definitions throughout…SLIs are the single most important part of the Reliability Stack…You may never get to the point of having reasonable SLO targets…The key is to have metrics that enable you to make statements that are either true or not.”

At the bottom of the Google Reliability Stack are service level indicators (SLI). These are measures of some aspect of the service quality, such as availability, latency, or failures. As indicated above, the Google Site Reliability Engineering teaching likes to treat them or transform then into binary values. The service is accessible or not. The service is unsatisfactorily slow or not. The response is good or not. There is no grey area here because of the need to turn such measures into a ratio of bad events to good events. A significant amount of confusion arises when it comes to the next layer in the Reliability Stack, where targets are defined – Service Level Objectives. This confusion mainly occurs when SRE teams make an SLI measure a measure of achieving a goal, which you would expect to be defined by an objective. In such cases, the difference between an SLI and SLO pertains mostly to windowing – some aspect of the ratio distribution is compared with some target value over a specified time interval. Layered on top of the SLOs are the Error Budgets used to track the difference between the indicator and the objective, again over time, and mostly at a different resolution suitable for policing and steering change management. Measures on measures on measures – and all quantitative.

It is unfortunate that so little if any critical thinking enters engineers’ minds once Google is mentioned as the originator of something because there are some serious design and operational issues. Google itself is open and honest about this in stating that many organizations adopting a service level objectives approach to reliability will not even get beyond defining service level indicators. Even then, service level indicators and objectives are invariably limited to edge entry points to a service or system of services. Many organizations don’t get off to a good start along this journey because beneath the service level indicator layer is the messy and increasing bloated world of metrics, traces, and logs. The notion of Service and a Consumer of such is all but lost amidst this data fog.

Once an organization manages to get passed translating data into information suitable for indicators and objectives, they face the most significant challenge to this whole model, and that is for every service level indicator, there is at least one service level objective, and for every service level objective, there is at least one error budget. This is a stack of similarly sized layers. Things get completely out of control and extremely costly (except maybe for the likes of Google) when multiple service level objectives are defined for a service level indicator. Here is where site reliability engineering becomes a profession of glorified spreadsheet data entry clerks. And because there is so much tight coupling between each layer, any change in one layer ripples throughout, creating an unwieldy maintenance effort. Being able to consolidate and compress the model in the form of systems, to reduce the management burden, is made complicated by the reliance on quantitative measures rather than qualitative. This just does not scale unless you’re Google. There are 581 pages in the O’Reilly book, that should be a sufficient warning in itself that simplicity is not to be found here.

What is strikingly odd about the whole site reliability engineering effort where the human user is placed front and center is that on the whole, there is nothing particular humane about it, except that Google and those who have adopted this approach focus exclusively in their doctrine on measuring and monitoring primarily at the edge and solely concerning human-to-machine interactions. System engineers need a new model that scales both up and down and can be applied effectively and efficiently at the edge and within a system of services. Otherwise, there will be a reliability model for this and another model for that. To realistically scale, the lower layers’ cost and complications need to be significantly reduced, moving up and outwards to other parties. The service level management language needs to be significantly simplified; instead of talking about objectives in terms of four or five nines, operations’ attention should be on a standard set of signals (outcomes and operations) and an even smaller set of status values. The primary service management task for operations should be configuring the scoring of sequences of signals and status changes between services. Inter and intra-communication should near exclusively relate to the subjective view of meaningful service states. Just that!

Appendix A: Service Level Objective Examples

Google SRE
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers)

Google SRE
90% of Get RPC calls will complete in less than 1 ms
99% of Get RPC calls will complete in less than 10 ms
99.9% of Get RPC calls will complete in less than 100 ms

OpenSignals
Service A will have a subjective OK status 99% of the time

OpenSignals has 16 built-in signals used to infer the status of a service, whereas, in the Google SLO specification above, there is only one signal referenced. Typically, there will be at least three service level indicators, and in turn, three objectives and error budgets.