Improving Situational Awareness in Systems of Services

Background

Much of the initial motivation underlying the design of OpenSignals reflected my concern with the wanton proliferation of metric instrumentation and custom dashboard creations. Tooling like Prometheus and Grafana has and continues to create an enormous ever-growing gap between data and information, collection and analysis, and perception and projection – many early adopters of such tools are no better off. I remember walking around large office spaces. As I passed from one squad to another, I was bewildered by the significant differences across each of the giant television monitors that served to delineate each team boundary. It was unquestionably clear that Grafana and Prometheus had somehow managed to break the circular feedback link between doing and knowing. My problem identification and resolution skills seemed irrelevant here; instead, the individuals most prized by the organization’s operational side where those who knew which of the thousands of dashboards to navigate to, across squads, when issues escalated across service boundaries. When incidents did arise in production, the disarray was depressing to watch as engineers wasted time trying to interpret what was being shown on each of the custom team dashboards. Inconsistency was everywhere in the naming of metrics, data conversion, time resolution, aggregation, representation, visualization, placement, etc. For the most part, the situation itself was utterly absent from the minds of all those involved. After a protracted period, most of the engineers opted to jump back and forth between logs and code source lines – wandering aimlessly down in the dungeons of data and details that are logging.

After witnessing one incident after another, what was clear was how little was known and understood during these events beyond the cue that brought the situation to engineerings’ attention. It was no wonder that not much changed between incidents; the level at which attention was given made each incident in production look unique in itself. The data values were different. The dashboards were different. The pixels rendered were different. The tooling failed spectacularly in assisting engineering in assessing the situation. The tooling’s design and usage were at odds with human cognitive capacities, never playing to a human’s strength in visual and contextual pattern matching. The overly large television monitors paraded by squads where displays of vanity built on an illusion of control. On reflection, much of the passing of the blame (and ticket) from one squad to another, from one microservice to another, could be primarily attributed to the differences in situational awareness levels across teams and the lack of a shared standard mental model.

Situational Awareness

What is awareness of the current situation when it comes to systems of services? One popular standard definition states that situation awareness is the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future (Endsley, 1987). Perception, level 1, involves sensing and collecting data, such as status and dynamics, related to an environment’s elements. Comprehension, level 2, consists of the integration of collected data transformed into information and, in turn, understanding. Comprehension is essential in assessing the significance of the elements and events and acquiring a big picture perspective. Projection, level 3, consists of applying acquired knowledge and analytical capabilities to predict possible what next states and interventions (if applicable). The accuracy of the prediction level depends, for the most part, on the accuracy and quality at the other lower levels. All processing at each level reflects current goals.

Simply put, situational awareness is knowing what is going on within an environment, and more importantly, what is essential.

Situational awareness is significantly enhanced by the development of internal mental models of the systems being managed within an environment. Such models serve to direct attention efficiently and offer a means to integrate information effectively while providing a future state projection mechanism. Unfortunately, much of what has been promoted in the Observability space, distributed tracing, metrics, logging, has not offered a suitable mental model in any form whatsoever. The level of situational awareness is still sorely lacking in the vast majority of teams who appear to be permanently stalled at ground zero and overtly preoccupied with data and details. It is now widely recognized by experts in the field of controllability, operability, and situational awareness, that more data does not equal more information. The problem with today’s operational support systems, such as application performance monitoring, is not data but finding what is needed and significant when required. What something means is crucial and paramount to awareness, subjective interpretation, and the construct of the situation, objective significance. Engineering needs a suitable situational model.

System Model

Working memory is the bottleneck within humans when it comes to situational awareness, particularly the prediction of future states – this is especially true for non-experts or novel situations. Mental models can circumvent such limitations in generating descriptions and explanations of systems, especially elements’ status. A model acts much like a schema for a system, with a situational model representing the system model’s current state, much like a snapshot. A model provides a means of achieving a much higher level of situational awareness without overloading working memory. Models should play to humans’ superior abilities in pattern matching, facilitating the direction of attention, a precious cognitive resource, in noting critical cues, allowing for projection of system states, and linking current system state and its classification to an appropriate intervention. A model’s selection reflects an operator’s goals, plans, and tasks, much like a template or class. It must be populated or instantiated, like an object, with data captured from within the operating environment. Goals facilitate a top-down process of decision making and planning. In contrast, patterns and cues within an environment allow for bottom-up processing to change goals and plans and the system model employed. In OpenSignals, the bottom-up direction to situational awareness is aided by signals fired by services and the status changes resulting from such firings. Top-down processing is underpinned in the scoring of signal and status change patterns at various attended collective scales (of aggregation and event propagation) and in the setting and on-going monitoring of service level objectives i.e., goals.

Design Challenges

The presentation of information is a vital factor in how much information can be acquired accurately, effectively assessed and understood, and related to operational needs, goals, and plans. An optimal design seeks to covey as much information as possible without undue cognitive effort – attention and working memory must be carefully conserved. Much of what drives today’s observability dashboards is data collection centric; minimal regard is given to orientating a user to the current situation. Information about the current goal, such as reliability, is rarely presented directly, being lost amongst many metrics and charts haphazardly thrown together on a limited screen real estate. Today, in determining a service’s operational status, an essential piece of information, an operator needs to combine multiple metrics, sometimes mistakenly called signals, within their internal model at great expense and error. Most dashboards are a poor and inadequate proxy for an operational and situational model. Critical cues, such as signals and states, need to be perceptually prominent; unfortunately, while much of the product and marketing literature around site reliability engineering does mention signals and states, these are not to be found or accessible in a manner suitable for pattern matching along space and time dimensions. Distributed tracing, metrics, and logging don’t lend themselves to the type of transformation, presentation, and communication needed here. These are measurements and data collection technologies, not situational models. Such yesteryear observability instrumentation techniques cannot be operationalized, save for when further diagnostics are required, which should always be guided by the situation’s assessment and awareness. The ever-growing problem of information overload in today’s application performance monitoring tooling needs to be tackled along the data pipeline, starting at the source with filtering and data reduction. Here OpenSignals excels in first limiting the value domain to a small set of signals and an even smaller set of status values.

Semantics and dynamics need to win over data and details; otherwise, there is very little hope of scaling to the increasing complexity and change rates. To achieve level 3 in situational awareness is difficult if the measurement models employed are not applicable or appropriate to projection into the near future. It is hard to imagine how a model consisting of data and details, collected by tracing and logging, could be used realistically to project future events and states. There is minimal compatibility and alignment here, whereas the model promoted by OpenSignals provides great immediacy with regard to significance in seeing and understanding the past and present and predicting possible future transitions, and offering assisted interventions. The only goal that would seem to be addressed by distributed tracing, metrics, and logging is secondary at best – collecting data instead of detecting cues, signals, and salient changes. Until we abandon the view that all data is equal and more is better, then it is unlikely that operators will move to the next level in organizing signals into meaningful patterns of information about the current situation that can then be sorted, prioritized, and categorized to project and predict future events and states of a system by way of mental models and long-term episodic memory.

The term Observability originated in Control Theory, where analysis of system behavior rests on states (status) and goals (objective). Feedback control mechanisms are constructed and configured to reduce the difference between a goal (state or set of conditions) and the current state. Feedback control responds to sensory stimuli, signals that infer state changes, with interventions and actions designed to correct divergence between the current state and goal. Feed-forward, a form of proactive anticipatory behavior, on the other hand, uses an extended model to predict what actions should be executed. The goal is the state, the operational status of one or more systems of services. Yet, much of what has been promoted and staged by charlatans, pretenders, and impostors in the monitoring (and now observability) domain is not related to or aligned. It is time to get back to the basics of awareness, attention, actioning, and adaptability within the operational management of systems of services: context, signals, services, (inferred or scored) status, scoring, sequencing, situation, and simulation. OpenSignals is the beginning of a complete overhaul of reliability engineering.