Observability: Signaling and Seeing the Situation

Let’s consider that the primary purpose of a monitoring dashboard is to bring immediate awareness of the situation to the viewer – to call attention to the current state of affairs, and support prediction and projection of scene in which situations exist and unfold over time. When you look at today’s dashboards built with Grafana, where a metric backend like Prometheus sources the data, it is clear we are so far removed from the situation that it is no wonder that things fail and spectacularly while being watched over by operators at incredible levels of data and detail. Engineers are drowning in data and hundreds of data displays. It seems like we are destined for some time to continue digging away at the mound of data that is growing at unprecedented levels, unable to lift our head above the fog and take stock of the dysfunctional system we have allowed to develop under our watch. We have lost all sense of system stability.

Now step back and consider what it means to observe, monitor, control, and manage services. Boiling this down to the fundamentals, we are primarily concerned with the status of a service or a system (set) of services. The status of a service is something we infer from the sequence of signals emitted or received by a service-to-service interaction (exchange). In this article, we will not go into how the status is determined from signaling. It is far more important for us to consider how to reason about and visually represent the status of service from different perspectives – those services (callers) dependant on the service (ingress), the service itself (transit), and those services (callees) the service depends on (egress). It is with a focus on the status where we begin to glimpse situations.

Below is a simple table model proposal for an effective and efficient rendering of a service or system of services. This should be the starting point for any service level monitoring and management dashboard system. Each row in the table represents a possible service status value starting with OK and ending with DOWN – from best case to worse case. The table columns represent the three possible perspectives of service status quality one generally takes in a service supply chain. In the first column, ingress, we look at the service under management from the callers’ perspective, as in how do the callers view the quality of service delivered from their end? The second column represents how the service itself, via its own signaling, views its service quality. Finally, the third column represents how the service views the quality of service of the services it depends on. In the table, we can see that the situation is stable (over some time period) in that all perspectives (columns) are reporting their values with an OK status (row).

You might now ask what do the numbers themselves represent? Well, in most microservice architectures, a service has multiple endpoints or operations if offers to callers. You usually get the CREATE (POST), READ (GET), UPDATE (PUT), and DELETE operations being exposed per domain entity managed by a service. So the numbers could represent the number of (logical) endpoints that are interacted with. Though several dimensions could be tallied within the table cells, we will keep it to this, counting the endpoints exposed. So the service in question, represented by the grid model, has 10 exposed endpoints. Each of the endpoints is deemed by the service itself to be operationally OK. From the perspective of callers, all 10 endpoints are also operationally OK. The number in the rightmost column represents the number of endpoints that the service interacts with. The egress endpoints could belong to just one service or multiple services. Services within the middle of a service supply chain will usually fan out service requests to other services.

The strength of the simplicity introduced above really shines when it comes to disruptions that spread throughout a network (or chain) of interconnected services and components. In the table above, the situation has changed with 5 of the egress (callees) endpoints now being assessed as deviating from the perspective of this service under consideration. It is important to note that this does not mean the endpoints are also seen as deviating by other possible services also interacting with the same endpoints.

The table above shows the unfolding situation is now spreading, with the service itself reporting that 8 of its exposed endpoints are now considered deviating from normal operations. At this point, the callers have not sensed it. The more resilient the service is, the longer the time window between the propagation of the status change. It does not matter whether the 8 deviating endpoints of the service do, in fact, call out to the 5 deviating egress points because the influence can be indirect, especially when working state (memory, files) and resources (pools, caches) are reused and shared across requests from multiple service clients.

After a period of time, the situation has spread further afield, with callers of the service now observing that 8 of the endpoints are indeed deviating. We have kept things rather simplistic in making the totals the same in both the ingress and transit columns. The numbers indeed need not be equal. It depends on the sensitivity of the callers to how quickly the deviation is detected. We could easily have a value of 10 listed in both OK and DEVIATING rows if there were configuration differences in the caller population. For this reason, we should not place too much emphasis on what the values in the cells constitute. What is far important is understanding what the grid representation reveals about the situation and enclosing scene, particularly the dynamics of change across interactions.

What if, instead of seeing hundreds of sparklines plotted across a data-board, engineers only needed to learn how to recognize situations by the visual patterns exhibited by the service level grid model. This would be the very beginning of a new direction for the observability and monitoring industry. Each of the grids below has something to show and inform, a snapshot at some point in time, which can then be further extended when such frames are sequenced. We would finally get back to the observation of system and service dynamics at an understandable, meaningful, and explainable scale to humans. From there, controllability is practical.