Author: William Louth

Observability: The OODA Loop

Today, companies face mounting pressure to demonstrate both speed and agility in an ever-changing and increasingly competitive environment. Information technology is seen as a critical enabler in adjusting to market shifts and threats and increasing customer expectations while evolving and improving offered services. For an organization to improve its capability to change the underlying systems supporting it must also change, and generally at a much faster rate to connect the past, present, and predicted future coherently with some degree of continuity. While the business focuses on charting a course from one change point to another on a timeline of services and market evolution, the computing infrastructure, and hapless engineering teams must deal with the only thing worst than change itself, and that is the transition period each of these points, moving from a discrete view of the world to one that is continuous and complicated. Not to worry, engineers believe they have everything pretty much figured out – make smaller changes but faster, much like a pilot in a fighter plane using the OODA loop to get within the loop of the enemy. Here the opponent is the business and engineering is proactively maneuvering and anticipating re-orientations as well as dictating the engagement. That is until the question is raised – how to observe and orient with an ever-growing big pile of low-level data hooked up to a bunch of dashboards.

Complexity and Collapse

The OODA loop places emphasis on two critical factors within an environment – time constraints and information uncertainty. The time factor is addressed by executing through the loop as fast as possible. Information uncertainty is tackled by acting accurately. The typical presentation of the model depicted below is popular because it closes the loop between sensing (observe and orient) and acting (decide and act). In the Observe phase, the focus is on data acquisition and information synthesis about the environment and the unfolding situation and interactions. The goal of the Orient phase, which follows the Observe phase, is to make sense of collected observations from an operational viewpoint. This understanding of the situation and potential scenarios that may follow on from this point in time is highly dependent on the level of expertise and experience of observers – situation assessors and decision-makers. The next step in the process is the Decide phase, in which information fed from the Orient phase determines the appropriate action(s). Finally, the Act phase is where the course of action decided upon earlier is implemented. The cycle repeats with further observations.

However, there are problems with the OODA model. It does not detail how later phases steer and influence, more specifically, self-regulate, earlier phases, and vice versa – invariably, it is seen and described as sequential without the ability to exit prematurely and then to re-enter. It also omits attention and memory and the cognitive representation of world states and models. It also lacks any deliberate planning and learning phases. The OODA model is broad in its description of the decision-making process and other than listing some of the factors pertinent to the Orient phase, it offers very little in the way of how to implement it. Mindless looping.

Seeing above the Data Clouds

The OODA model’s biggest issue is that it does not capture the encompassing goal and objectives, making the loop very reactive rather than proactive. The model appeals to one of the worst trends within software engineering and services operations – big data addiction. Here effective operations management and the decision-making it entails are seen as merely a problem of insufficient data collection and information construction. Unfortunately, expanding the capacity to transmit more and more data to the cloud has not improved situation awareness; in fact, it seems to have made it more difficult, if not impossible. It’s not just simple; it’s simplistic.

OODA and Big Data do not reflect much of how human perception and cognition work to direct attention and interpret sensor signals. OODA, and much of the work currently ongoing in the Observability space, incorrectly assume engineering is principally passively reacting to environmental-sourced events – this is never more so exemplified in the design of data-ladened monitoring dashboards.

Successful service operations and management in dynamic environments enclosing highly complex systems depends on maintaining a focus on clear goals at various levels of composition and planning how to achieve and maintain them. To some degree, OODA and Big Data approach human-and-machine cognition with a simplistic, mechanistic, and data-centric viewpoint, completely ignorant of situations and scenes, intentions and inferences, signals and sequences services and states – devoid of patterns and models that could help to more effectively direct both humans and computer attention in assessing current conditions, predicting future events, and tracking the results of scripted and curated system interventions. Data, data everywhere, and not a situation to be recognized.

Head down in the Data

Let’s stop for a moment and ask ourselves the question. And I am referring to our collective here. How does one even begin to imagine that a site reliability engineer (SRE) can go from hundreds, if not thousands, of different distinct trace paths, metrics, and log patterns to the formulation of the current situation against which it is compared with past prototypical patterned situations? And if you are naive enough to believe that machine learning will solve this issue, then ask yourself how one, human or machine, can honestly and with some degree of certainty predict from such contextless data the transitions between situations and explain this to an engineer in a communicable form. Cognition is situated and yet the industry keeps offering up solutions, with even bigger problems, that further disconnect us from the situation, beyond the fact that we’re slowly sinking in a quicksand pit of data making it near impossible to act.

From Data to Dashboard – An Observability Anti-Pattern

Numerous initiatives around Observability, sometimes referred to as Visibility in the business domain, fail to meet expectations due to engineers naively expecting that once data is being collected, all that needs to be done is to put up a dashboard before sitting back to stare blankly at large monitoring screens hoping for a signal to emerge from the pixels rendered magically. This is particularly so when users blindly adopt Grafana and Prometheus projects where data and charts have replaced, or circumvented, genuine understanding through patterns, structures, and models. This is an anti-pattern that seems to repeat consistently at organizations with insufficient expertise and experience in systems dynamics, situation awareness, and resilience engineering. Once the first data-laden dashboard is rolled out to management for prominent display within an office, it seems like the work is all but done other than to keep creating hundreds of more derivatives of the same ineffective effort. Little regard is ever again given to the system, its dynamics, and situations arising. Many projects fail in thinking, more so acting, that they can leap from data to dashboard in one jump.

This is not helped by many niche vendors talking up “unknown unknowns” and “deep systems,” which is more akin to giving someone standing on the tip of an iceberg a shovel and asking them to dig away at the surface. There is nothing profound or fulfilling to be found doing so other than discovering detail after detail and never seeing the big picture that pertains to the system moving and changing below the surface of visibility that comes from event capture that is not guided by knowledge or wisdom. The industry has gone from being dominated by a game of blame to fear, which shuts off all (re)consideration of effectiveness.

Seeing is not Understanding

I suspect much of the continued failings in the Observability space centers around the customary referencing and somewhat confused understanding of the Knowledge (DIKW) Hierarchy. Many “next-generation” application performance monitoring (and observability) product pitches or roadmaps roll out this pyramid graphic below, explaining how they will first collect all this data, lots of it from numerous sources, and then whittle it down to knowledge throughout the company’s remaining evolution and product development.

What invariably happens is that the engineering teams get swamped by maintenance effort around data and pipelines and the never-ceasing battle to keep instrumentation kits and extensions up-to-date with changes in platforms, frameworks, and libraries. By the time some small window of stability opens up, the team has lost sight of the higher purpose and bigger picture. In a moment of panic, the team slaps on a dashboard and advanced query capabilities in a declaration of defeat by delegating effort to users. Naturally, this defecating defeat is marketed as a win for users. This sad state of affairs comes about because of seeing the hierarchy as a one-way ladder of understanding. From data, the information will emerge; from information, the knowledge will emerge, etc. Instead of aiming for vision all too often, it is data straight to visualizations. The confusion is thinking this is a bottom-up approach, whereas the layers above steer, condition, constrain the layers below by way of the continuous adaptive and transforming process. Each layer here is framing the operational context of lower layers – direct and indirectly. A vision for an “intelligent” solution comes from values and beliefs; this then contextualizes wisdom and, in turn, defines the goals that frame knowledge exploration and acquisition processing.

The Knowledge Hierarchy

For knowledge to spring forth from information, various (mental) models must be selected; a selection aligned to the overarching goals. It is here where I firmly believe we have lost our way as an engineering profession. If we can call them that, our models are too far removed from purpose, goal, and context. We have confused a data storage model of trace trees, metrics, log records, and events, as a model of understanding. In the context of Observability, an example of a goal in deriving wisdom would be to obtain intelligent near-real-time situation awareness over a large connected, complex, and continually changing landscape of distributed services. Here, understanding via a situation model must be compatible and conducive to cooperative work performed by both machines and humans. Ask any vendor to demonstrate a situation’s representation, and all you will get is a dashboard with various jagged lines automatically scrolling. Nowhere to be found are signals and states, essential components of a past, present, and unfolding situation.

There is never knowledge without a model acting as a lens and filter, an augmentation of our senses and reasoning, defining what it is that is of importance – the utility and relevance of information in context. There is never information without rules, shaped by knowledge, extracting, collecting, and categorizing data. Data and information are not surrogates for a model. Likewise, a model is not a Dashboard, one built lazily and naively on top of a lake of data and information. A dashboard and many metrics, traces, and logs that come with it are not what constitutes a situation. A situation is formed and shaped by changing signals and states of structures and processes within an environment of nested contexts (observation points of assessment) – past, present, and predicted.

Models are critical when it comes to grasping at understanding in a world of increasing complexity. A model is a compact and abstract representation of a system under observation and control that facilitates conceptualization and communication about its structure and, more importantly, dynamics. Modeling is a simplification process that helps focus attention on what is of significance for higher-level reasoning, problem-solving, and prediction. Suitable models (of representation in structure and form) are designed and developed through abstraction and the need to view a system from multiple perspectives without creating a communication disconnect for all involved. Coherence is an essential characteristic of a model, as is conciseness and context. Unfortunately, introducing a model is not always as easy as a task, as it might look on paper if the abstraction does not pay off in terms of significant simplification and a shift in focus to higher levels of value. For example, Instana, where I was recently a Complexity Scientist, had some troubles in convincing many of those coming from an OpenTelemetry background that their abstraction of a Call over Span served a useful purpose.

This mismatch between what a developer conceptualizes at the level of instrumentation and what is presented within the tooling, visualizations, and interfaces is seen as an inconvenience – an inconvenient truth stemming from an industry that does far too much selling of meme-like nonsense and yesteryear thinking and tooling than educating in theory and practice. A focus on systems and dynamics needs to win over data and details to get back to designing and building agile, adaptive, and reliable enterprise systems.

Perception and Representation

In my current position at PostNL, I’m helping to design and develop a Control Tower for an ambitious Digital Supply Chain Platform. Depending on the domain and perspective taken, there are different possible models – objects (parcels), processes (sorting), and flows (transport). Still, it all comes down to transforming data at the sensory measurement level up into structures and then behavior along with affordance at increasing levels of abstractions, compression, and comprehension. While a Control Tower could readily track every small detail of a parcel’s movement, it would not effectively and efficiently understand the dynamics that emerge at the system level across many cooperating agents within such a highly interconnected network where promises, related to resource exchanges, are made, monitored, and adjusted in the event of disruptions. One agent’s (human or machine) model is another one’s raw data.

Aside: While it is hard not to see the importance of parcel tracking within a supply chain, at least from a customer perspective, I’ve still to have someone offer up a valid justification for distributed tracing in its current form over the approach OpenSignals takes.

Hierarchies of Attention and Abstraction in Observability

When designing observability and controllability interfaces for systems of services, or any system for that matter, it is essential to consider how it connects the operator to the operational domain in terms of the information content, structure, and (visual) forms. What representation is most effective in the immediate grounding of an operator within a situation? How efficiently and accurately can the information be communicated, playing to perceptual and cognitive strengths? Here is where an abstraction and attention hierarchy can significantly facilitate the monitoring and management of a complex system, chiefly where situational awareness, understanding of dynamics and constraints, and the ability to predict, to some small degree into the future, are of critical importance.

While an abstraction hierarchy consists of multiple levels (or layers), it still relates to the same operational domain, much like situation awareness or service level management. At each level, a new model of terms, concepts, and relations are introduced. An observer of such a hierarchically organized system can decide which particular level of information and interaction is most suitable for the task (or goal) at hand, based on their level of expertise and knowledge of the system. Levels in the abstraction hierarchy regulate lower levels by way of constraints and conditions, whereas lower levels stimulate state changes in higher levels. In crossing the layered boundaries, an operator increases their understanding of the system. Moving from bottom to top, the operator obtains a greater awareness of what is significant in terms of the system’s functioning and goals. While moving from top to bottom, a more detailed explanation and exploration are provided of how a system brings about higher levels’ goals and plans – dynamics surface at the top and data at the bottom. It is important that there is some degree of traceability and relatability in moving in either direction.

From Stimulus to Synthesis

In thinking about an abstraction level, several internal processes and functions come to mind that occur as information flows upwards from a lower level, passing through before being translated and transformed into something more significant and relevant at a level above. It starts with some stimulus deemed as a happening, an event, originating within either the environment or a lower level if one is present. A recurring stimulus becomes a signal if it pertains to the model and has benefits in terms of constraint, correction, and conditioning. Then, depending on sensitivities, much of the noise is eliminated to prioritize what is significant or signified by signaling. Following on from significance, a process of sequence recording, recalling, and recognition results in the synthesis of meaningful patterns. It starts with measurement, passing through to the model, and ending with the memory of such. Finally, the new information produced is propagated upwards to the next level of abstraction in the hierarchy, typically in a far more condensed form that gives utmost attention to changes in patterned or predicated behavior relevant to the means employed in achieving the higher-level goal(s).

With OpenSignals for Services, there is at least a minimum of three levels of abstraction to the hierarchy, leaving out service level management, which can be layered on top. The first level of abstraction pertains to the machine world’s spatial and (meta) state aspects, where a stimulus is sensed – the referent layer. Here services (or endpoints) within a distributed system of executing processes are mapped to a model consisting of Context, Environment, Service, and Name elements. The next level above, the signaling layer, is where executing instrumented code is transcribed into a series of Signal firings on references within the level below. In this second level, code execution’s complicated nature across service boundaries entailing failure, retry, and fallback mechanisms are translated into a standard set of tokens. The small set of signals is akin to RISC (reduced instruction set computer) design, but instead of a computer execution architecture, it pertains to service-to-service communication across machine and process boundaries. Depending on the service Provider implementation (SPI), the signals will be scored based on sensitivity and sequencing settings, which can in part be driven by the current state in the level above, and a status inferred for each signaling Service registered within a Context. For recording and diagnostics purposes, the signaling level also allows for the registration of a Subscriber and, in turn, Callback.

Scaling to Perspectives

Finally, in the last of the minimum levels, the inference layer, the model is attuned to tracking changes in registered and signaling services’ Status value. Much like the signaling layer below, this level allows for event Subscription management. As an observer moves up layers in the abstraction and attention hierarchy, the frequency of events emittance is reduced significantly, and the meaning and relevance of each element and possibility in the model relate closer to the overall goal. In this top-level, observation, analysis, and (indirect) controllability of the model centers around the switches in the transitions between Status values, the sequencing and scoring of such over periods, and the possibility of predicting near future states of the system. The scale of analysis also changes, from the signaling and operational status inference of a single service to how changes in the status of a system of (interconnected) services cascade and ripple in unexpected ways through the networked of contexts – both locally and globally. It would be near impossible to do so at the signaling level for a human, at least. At all levels of the hierarchy, the same processes are involved – seeing, sensing, signifying, sequencing, scaling, and synthesizing higher-order information for representation, reduction, recognition, recording, recollection, reasoning, retrospection, reinforcement, regulation, and rectification.

Observability via (Verbal) Protocol Analysis

I’m always on the lookout for new ways to explain and relate to the design of the OpenSignals conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent certification in Design Thinking and some readings in situational awareness. VPA is a technique that has been used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes involved by a subject from start to completion of a task. After further processing, the information captured is then analyzed to provide insights that can be used to improve performance, possibly.

An advantage of verbal protocol analysis over other cognitive investigation tasks is its richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed. Sound familiar? Yes, the same issue site reliability engineering (SRE) teams face when their primary data sources for monitoring and observability are event logging and its sibling distributed tracing.

From Record to Analysis

The basic steps to Protocol Analysis are (1) record the verbalization, (2) transcribe the recording, (3) segment the transcription, (4) aggregate the segments into episodes, (5) encode the episodes, and finally (6) analyze the code sequencing patterns. During the transcribing step, researchers will interpret the recording in terms of a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses one idea or action statement. In the aggregate step, some segments are collapsed and combined into episodes to make further coding and data analysis more straightforward, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost. The most crucial step to this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, a coding scheme needs to be effective and reliable in translation and express the aspects of concern for the investigation. Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes.

In the case of an investigation into how designers think the variables might be the "design step", "knowledge", "activity", and "object". A more abstract variable set would be "subject", "predicate", and "object". A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement or event in the real world and mapping it to the appropriate code across different persons tasked with the coding.

A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be OpenSignals service status inference from signaling.

If you have ever spent time working with vast amounts of logs, metrics, and distributed tracing data, you immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services. These days, most site reliability engineers get stuck in the transcribing phases, trying to bring some uniformity and meaning to many different machine utterances, especially in logs and events. I’ve witnessed many an organization start an elaborate and ambitious initiative to remap all log records into something more relatable to service level management or situational awareness via various record-level rules and pattern matches only to abandon the initiative when the true scale of the problem is recognized and the human effort involved to not only to define but maintain such things. These tasks only ever look good in vendor demonstrations, never reflecting the change rate that all software is undergoing at present and into the future. You might ask how has Protocol Analysis in practice attempted to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand.

It should be noted that for many doing VPA in new domains that the coding scheme is defined much later in the process. Fortunately, in the Observability space of software systems, we are dealing with machines as opposed to humans, so it is far easier to introduce appropriate coding into the coding process. That is precisely what OpenSignals is offering – a fixed set of variables in the form of "service", "orientation", and "phenomenon", and a set of predefined codes for orientation and phenomenon (signals and status).

From Smart to Simple

OpenSignals for Services is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work.

It is time for software services to think aloud with OpenSignals and to abandon sending meaningless blobs of data to massive event data blackholes in the cloud. It is time to standardize on a model that serves site reliability engineering and not some manufactured data addiction. Let’s have both machines and humans communicate in terms of service, signal, and status.

If you are interested in where we go from here, having managed to spend more time in analysis, then do yourself a favor and read Mark Burgess’s recent research paper – The Semantic Spacetime Hypothesis A Guide the Semantic Spacetime Project Work.

Orientating Observability

The orientation concept in OpenSignals seems to be somewhat novel or not so familiar amongst previous attempts at signaling. The reason for orientation, which has two values of EMIT and RECEIPT, is very straightforward once we take into account that there are two perspectives to any interaction or communication. Let’s imagine a signaling model that consists of three signals – CALL, TALK, and END. Below are two transcripts of the signals recorded at either end of a phone conversation between Alice and Bob. Before proceeding it is important to note that both parties, Alice and Bob, are modeled as sources of signals within each context at either end of the communication channel. Bob has a model that includes Alice as well as Bob himself. Alice does likewise within her context.

Simple Signalling

Can you tell who initiated the call? We could probably infer who ended the call, maybe Bob, but the initiator not so. Maybe if we had some global clock we could order across both contexts but then we run into all other types of problems related to clock synchronization or needing an omnipresent observer. An easy fix to the language would be introducing a new signal named CALLED; we could also do likewise with TALKED and ENDED. We have now doubled our small set of signals and somewhat broken the link between a signal emitted on one side to how it is received and interpreted on the other. With the concept of orientation, we don’t need to embed more meaning into a signal, a mere token, than is required. When Alice decides to call Bob, she will record within her context the event – Bob EMIT CALL. This might initially look strange because Bob is not present here, and now we see he is emitting a signal. But EMIT here pertains to the locality of the operation, CALL, and not the subject, Bob. Bob’s calling is happening within the context that Alice maintains of her social network and the status she infers on the members within, including herself. When Bob picks up Alice’s call, within his context, he will record – Alice RECEIPT CALL. Because the call originated outside of his context, it is a past signal that he has now received – RECEIPT is always dealing with the perception of something that has occurred in the past and elsewhere in terms of the context maintained. The trick to understanding and reasoning about signal orientation is to separate the act of picking up the phone once it has rung, which Bob does, from the actual signal being communicated by the action. Signals are not trying to be a comprehensive representation of physical movements or events but the underlying meanings behind such or what transpires in doing so. Bob is a recipient of something that was previously recorded elsewhere. Try considering signaling to operate somewhat like a double-entry booking keeping system. An EMIT on one side will, if the signal is delivered by whatever means, will result in a corresponding RECEIPT on the other. Note: If Bob did not pick up there would be no corresponding reverse entry.

Signalling with Orientation

Streamlining Observability Data Pipelines

The first generation of Observability instrumentation libraries, toolkits, and agents have taken the big data pipelines of both application performance monitoring (traces, metrics) and event collection (logs) and combined them into one enormously big pipe to the cloud.

>

Scaling Observability via Attention and Awareness

OpenSignals tackles scaling of observability on many fronts – at the machine and human interfaces and along the data pipeline starting at the source. Before discussing how OpenSignals addresses situational awareness in the context of a system of (micro)services (deferred for a future post), it is crucial to consider the foundational model elements independently of a domain.

>

The need for a new more modern Reliability Stack

In this post, we consider why it may be time to abandon the approach to service level management that has been strongly advocated for in the Google Site Reliability Engineering book series. But before proceeding, let us consider some excerpts from the O’Reilly Implementing Service Level Objectives Book to set the stage for a re-examination in the context of modern environments.

“…a proper level of reliability is the most important operational requirement of a Service…There is no universal answer to the question of where to draw the line between SLI and SLO, but it turns out this doesn’t really matter as long as you’re using the same definitions throughout…SLIs are the single most important part of the Reliability Stack…You may never get to the point of having reasonable SLO targets…The key is to have metrics that enable you to make statements that are either true or not.”

The Old Way

At the bottom of the Google Reliability Stack are service level indicators (SLI). These are measures of some aspect of the service quality, such as availability, latency, or failures. As indicated above, the Google Site Reliability Engineering teaching likes to treat them or transform then into binary values. The service is accessible or not. The service is unsatisfactorily slow or not. The response is good or not. There is no grey area here because of the need to turn such measures into a ratio of bad events to good events. A significant amount of confusion arises when it comes to the next layer in the Reliability Stack, where targets are defined – Service Level Objectives. This confusion mainly occurs when SRE teams make an SLI measure a measure of achieving a goal, which you would expect to be defined by an objective. In such cases, the difference between an SLI and SLO pertains mostly to windowing – some aspect of the ratio distribution is compared with some target value over a specified time interval. Layered on top of the SLOs are the Error Budgets used to track the difference between the indicator and the objective, again over time, and mostly at a different resolution suitable for policing and steering change management. Measures on measures on measures – and all quantitative.

It is unfortunate that so little if any critical thinking enters engineers’ minds once Google is mentioned as the originator of something because there are some serious design and operational issues. Google itself is open and honest about this in stating that many organizations adopting a service level objectives approach to reliability will not even get beyond defining service level indicators. Even then, service level indicators and objectives are invariably limited to edge entry points to a service or system of services. Many organizations don’t get off to a good start along this journey because beneath the service level indicator layer is the messy and increasing bloated world of metrics, traces, and logs. The notion of Service and a Consumer of such is all but lost amidst this data fog.

Once an organization manages to get passed translating data into information suitable for indicators and objectives, they face the most significant challenge to this whole model, and that is for every service level indicator, there is at least one service level objective, and for every service level objective, there is at least one error budget. This is a stack of similarly sized layers. Things get completely out of control and extremely costly (except maybe for the likes of Google) when multiple service level objectives are defined for a service level indicator. Here is where site reliability engineering becomes a profession of glorified spreadsheet data entry clerks. And because there is so much tight coupling between each layer, any change in one layer ripples throughout, creating an unwieldy maintenance effort. Being able to consolidate and compress the model in the form of systems, to reduce the management burden, is made complicated by the reliance on quantitative measures rather than qualitative. This just does not scale unless you’re Google. There are 581 pages in the O’Reilly book, that should be a sufficient warning in itself that simplicity is not to be found here.

What is strikingly odd about the whole site reliability engineering effort where the human user is placed front and center is that on the whole, there is nothing particular humane about it, except that Google and those who have adopted this approach focus exclusively in their doctrine on measuring and monitoring primarily at the edge and solely concerning human-to-machine interactions. System engineers need a new model that scales both up and down and can be applied effectively and efficiently at the edge and within a system of services. Otherwise, there will be a reliability model for this and another model for that. To realistically scale, the lower layers’ cost and complications need to be significantly reduced, moving up and outwards to other parties. The service level management language needs to be significantly simplified; instead of talking about objectives in terms of four or five nines, operations’ attention should be on a standard set of signals (outcomes and operations) and an even smaller set of status values. The primary service management task for operations should be configuring the scoring of sequences of signals and status changes between services. Inter and intra-communication should near exclusively relate to the subjective view of meaningful service states. Just that!

A Modern Way

Appendix A: Service Level Objective Examples

Google SRE
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers)

Google SRE
90% of Get RPC calls will complete in less than 1 ms
99% of Get RPC calls will complete in less than 10 ms
99.9% of Get RPC calls will complete in less than 100 ms

OpenSignals
Service A will have a subjective OK status 99% of the time

OpenSignals has 16 built-in signals used to infer the status of a service, whereas, in the Google SLO specification above, there is only one signal referenced. Typically, there will be at least three service level indicators, and in turn, three objectives and error budgets.