Improving Situational Awareness in Systems of Services

Ramping Information

Much of the initial motivation underlying the design of OpenSignals reflected a concern with the wanton proliferation of metric instrumentation and custom dashboard creations. Tooling like Prometheus and Grafana has and continues to create an enormous ever-growing gap between data and information, collection and analysis, and perception and projection – many early adopters of such tools are no better off. We can all remember walking around large office spaces and, as we passed from one squad to another, being bewildered by the significant differences across each of the giant television monitors that served to delineate each team boundary. It is unquestionably clear that Grafana and Prometheus have managed to break the circular feedback link between doing and knowing.

Disorder: Dashboard Myopia

Problem identification and resolution skills seem irrelevant here; instead, the individuals most prized by an organization’s operational side are those who know which of the thousands of dashboards to navigate to across squads when issues escalate across service boundaries. When incidents arise in production, the disarray is depressing to watch as engineers waste time interpreting the multitude of custom team dashboards. Inconsistency is everywhere in the naming of metrics, data conversion, time resolution, aggregation, representation, visualization, placement, etc. For the most part, the situation itself is utterly absent from the minds of those involved. After a protracted period, most engineers opt to jump back and forth between logs and code source lines – wandering down in the dungeons of data and logging details.

Illusion of Control

After witnessing one incident after another, what was clear was how little was known and understood during these events beyond the cue that brought the situation to engineerings’ attention. It was no wonder that not much changed between incidents; the level of awareness made each production incident look unique. The data values were distinct, dashboards unique, rendered pixels different. The tooling failed spectacularly in assisting engineering in assessing the situation. The tooling’s design and usage were at odds with human cognitive capacities, never playing to a human’s strength in visual and contextual pattern matching. The overly large television monitors paraded by squads were displays of vanity built on an illusion of control. On reflection, the passing of the buck from one unit or microservice to another is a result of differences in situation awareness levels across teams and the lack of a shared mental model.


Awakening: Situation Awareness

What is awareness of the current situation when it comes to systems of services? One popular standard definition states that situation awareness is the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status (Endsley, 1987). Perception, level 1, involves sensing and collecting data, such as status and dynamics related to an environment’s elements. Comprehension, level 2, consists of integrating collected data transformed into information and, in turn, understanding. Comprehension is essential in assessing the significance of the elements and events and acquiring a big picture perspective. Projection, level 3, consists of applying acquired knowledge and analytical capabilities to predict the following states and possible interventions (if applicable). The accuracy of the prediction level depends, for the most part, on the accuracy and quality at the other lower levels. All processing at each level reflects current goals.

Ground Zero

Situation awareness is knowing what is going on within an environment, and more importantly, what is essential. In developing internal mental models of systems managed, we significantly enhance situational awareness. Such models serve to direct attention efficiently and offer a means to integrate information effectively while providing a future state projection mechanism. Unfortunately, much of the solutions promoted in the Observability space, such as distributed tracing, metrics, and logging, have not offered a suitable mental model in any form whatsoever. The level of situation awareness is still sorely lacking in the vast majority of teams who appear to be permanently stalled at ground zero and overtly preoccupied with data and details.

Bringing Significance into Focus

It is now widely recognized that more data does not equate to more information. The problem with today’s operational support systems, such as application performance monitoring, is not data but finding what is needed and significant when required. What something means is crucial and paramount to awareness, subjective interpretation, and the construct of the situation, objective significance. Engineering needs a suitable situational model.

Minds and Machines need Models

Working memory is the bottleneck within humans regarding situation awareness, particularly the prediction of future states – this is especially true for non-experts or novel situations. Mental models can circumvent such limitations in generating descriptions and explanations of systems, especially elements’ status. A model acts as a schema for a plan, with a situation model representing the system model’s current state, much like a snapshot. A model provides a means of achieving a much higher level of situation awareness without overloading working memory. Models should play to humans’ superior abilities in pattern matching, facilitating the direction of attention, a precious cognitive resource, noting critical cues, allowing for projection of system states, and linking the current system state and its classification to an appropriate intervention.

Bidirectional Feedback

A model’s selection reflects an operator’s goals, plans, and tasks, much like a template or class. It must be populated or instantiated, like an object, with data captured within the operating environment. Goals facilitate a top-down process of decision-making and planning. In contrast, patterns and cues within an environment allow for bottom-up processing to change goals and plans and the system model employed. In OpenSignals, the signals fired by services and the status changes resulting from such firings aid situation awareness. Top-down processing is underpinned in scoring signal and status change patterns at various attended collective scales (of aggregation and event propagation) and in the setting and ongoing monitoring of service level objectives (goals).


Design Challenge

The presentation of information is a vital factor in how much information can be acquired accurately, effectively assessed and understood, and related to operational needs, goals, and plans. An optimal design seeks to covey as much information as possible without undue cognitive effort – attention and working memory must be carefully conserved. Much of what drives today’s observability dashboards is data collection-centric, with minimal consideration for orienting users to the current situation. Information about the current goal, such as reliability, is rarely presented directly, being lost amongst many metrics and charts haphazardly thrown together on a limited screen real estate. Today, in determining a service’s operational status, an essential piece of information, an operator needs to combine multiple metrics, sometimes mistakenly called signals, within their internal model at great expense and error.

A Lost Signal

Most dashboards are a poor and inadequate proxy for an operational and situational model. Critical cues, such as signals and states need to be perceptually prominent. Unfortunately, while much of the product and marketing literature around site reliability engineering mentions signals and states, these are not to be found or accessible in a manner suitable for pattern matching along space and time dimensions. Distributed tracing, metrics, and logging don’t lend themselves to the type of transformation, presentation, and communication needed here. These are measurements and data collection technologies, not situational models.

Streamlining Streams

Such yesteryear observability instrumentation techniques cannot be operationalized, save for when further diagnostics are required, which should always be guided by the situation’s assessment and awareness. The ever-growing problem of information overload in today’s application performance monitoring tooling needs to be tackled along the data pipeline, starting at the source with filtering and data reduction. Here OpenSignals excels in limiting the value domain to a small set of signals and an even smaller group of status values.

Projecting the Future

Semantics and dynamics need to win over data and details; otherwise, there is very little hope of scaling to the increasing complexity and change rates. To achieve level 3 in situation awareness is difficult if the measurement models employed are not applicable or appropriate to projection into the near future. It is hard to imagine how a model consisting of data and details collected by tracing and logging can project future events and states. There is minimal compatibility and alignment here, whereas the model promoted by OpenSignals provides great immediacy concerning significance in seeing and understanding the past and present, predicting possible future transitions, and offering assisted interventions.

Less Data, More Signs

The only goal that would seem to be addressed by distributed tracing, metrics, and logging is secondary at best – collecting data instead of detecting cues, signals, and salient changes. Until we abandon the view that all data is equal and that more of it is better, it is unlikely that operators will ever move to the next level. Suppose we organize signals into meaningful patterns of the current situation. In that case, we have a much better chance to project and predict future events and states of a system.

Control: The Origin of Observability

The term Observability originated in Control Theory, where analysis of system behavior rests on states (status) and goals (objective). Feedback control mechanisms are constructed and configured to reduce the difference between a goal (state or set of conditions) and the current state. Feedback control responds to sensory stimuli, signals infer state changes, with interventions and actions designed to correct divergence between the current state and goal. Feed-forward, a form of proactive, anticipatory behavior, on the other hand, uses an extended model to predict actions to be executed. The goal is the state, the operational status of one or more systems of services. It is time to get back to the basics of awareness, attention, actioning, and adaptability within the operational management of systems of services: context, signals, services, (inferred or scored) status, scoring, sequencing, situation, and simulation. OpenSignals is the beginning of a complete overhaul of reliability engineering.