Observability: Signaling and Seeing the Situation

Situation Awareness

Let’s consider that the primary purpose of a monitoring dashboard is to bring immediate awareness of the situation to the viewer – to call attention to the current state of affairs and support prediction and projection of scenes in which situations exist and unfold over time. When you look at today’s dashboards built with Grafana, where a metric backend like Prometheus sources the data, it is clear we are so far removed from the situation that it is no wonder that things fail and spectacularly while being watched over by operators at incredible levels of data and detail.

More is Less

Engineers are drowning in data and hundreds of data displays. It seems like we are destined for some time to continue digging away at the mound of data that is growing at unprecedented levels, unable to lift our head above the fog and take stock of the dysfunctional system we have allowed to develop under our watch. We have lost all sense of system stability.

Seeing Sense

Now step back and consider what it means to observe, monitor, control, and manage services. Boiling this down to the fundamentals, we are primarily concerned with the status of a service or a system (set) of services. The service status is something we infer from the sequence of signals emitted or received by a service-to-service interaction (exchange). In this article, we will not go into how the status is determined from signaling. It is far more important for us to consider how to reason about and visually represent the status of service from different perspectives – those services (callers) dependant on the service (ingress), the service itself (transit), and those services (callees) the service depends on (egress). It is with a focus on the status where we begin to glimpse situations.

Service Level Grid

Below is a simple table model proposal for an effective and efficient rendering of a service or system of services. This should be the starting point for any service level monitoring and management dashboard system. Each row in the table represents a possible service status value starting with OK and ending with DOWN – from best to worse. The table columns represent the three possible perspectives of service status quality one generally takes in a service supply chain. In the first column, ingress, we look at the service under management from the callers’ perspective, as in how do the callers view the quality of service delivered from their end? The second column represents how the service itself, via its own signaling, views its service quality. Finally, the third column represents how the service views the quality of service of the services it depends on. In the table, we can see that the situation is stable (over some time period) in that all perspectives (columns) are reporting their values with an OK status (row).

SERVICE
STATUS
INGRESS
CALLERS
TRANSIT
SELF
EGRESS
CALLEES

Quality Tracking

You might now ask what do the numbers themselves represent? Well, in most microservice architectures, a service has multiple endpoints or operations it offers to callers. You usually get the CREATE (POST), READ (GET), UPDATE (PUT), and DELETE operations being exposed per domain entity managed by a service. So the numbers could represent the number of (logical) endpoints that are interacted with. Though several dimensions could be tallied within the table cells, we will keep it to this, counting the endpoints exposed. So the service in question, represented by the grid model, has 10 exposed endpoints. Each of the endpoints is deemed by the service itself to be operationally OK. From the perspective of callers, all 10 endpoints are also operationally OK. The number in the rightmost column represents the number of endpoints that the service interacts with. The egress endpoints could belong to just one service or multiple services. Services within the middle of a service supply chain will usually fan out service requests to other services.

SERVICE
STATUS
INGRESS
CALLERS
TRANSIT
SELF
EGRESS
CALLEES

Ripple Effects

The strength of the simplicity introduced above really shines when it comes to disruptions that spread throughout a network (or chain) of interconnected services and components. In the table above, the situation has changed, with 5 of the egress (callees) endpoints now being assessed as deviating from the perspective of this service under consideration. It is important to note that this does not mean the endpoints are seen as deviating by other possible services also interacting with the same endpoints.

SERVICE
STATUS
INGRESS
CALLERS
TRANSIT
SELF
EGRESS
CALLEES

Cascading Causes

The table above shows the unfolding situation is now spreading, with the service itself reporting that 8 of its exposed endpoints are now considered deviating from normal operations. At this point, the callers have not sensed it. The more resilient the service is the longer the time window between the propagation of the status change. It does not matter whether the 8 deviating endpoints of the service do, in fact, call out to the 5 deviating egress points because the influence can be indirect, especially when working state (memory, files) and resources (pools, caches) are reused and shared across requests from multiple service clients.

SERVICE
STATUS
INGRESS
CALLERS
TRANSIT
SELF
EGRESS
CALLEES

Chain Reaction

After a period of time, the situation has spread further afield, with callers of the service now observing that 8 of the endpoints are indeed deviating. We have kept things rather simplistic in making the totals the same in the ingress and transit columns. The numbers indeed need not be equal. It depends on the sensitivity of the callers to how quickly the deviation is detected. If there were configuration differences in the caller population, we could easily have a value of 10 listed in both OK and DEVIATING rows. For this reason, we should not emphasize what the values in the cells constitute. What is far important is understanding what the grid representation reveals about the situation and enclosing scene, particularly the dynamics of change across interactions.

Pattern Recognition

What if, instead of seeing hundreds of sparklines plotted across a data-board, engineers only needed to learn how to recognize situations by the visual patterns exhibited by the service level grid model. This would be the very beginning of a new direction for the observability and monitoring industry. We would finally observe system and service dynamics at an understandable, meaningful, and explainable scale to humans. From there, controllability is practical.