Observability: Where is it heading?

Observability is a hot topic in software and system engineering circles even though it is poorly defined and largely misunderstood. At this point in time, it would seem impossible to state what observability is and how does it differ from monitoring. There is so much vendor noise as well as nonsense and misinformation regarding pillarsplatformspipelinesunknown unknowns, and deep systems that any talk of signals, the beginnings of a path to complex systems seeing, sensing, and steering, is likely to be ignored or discarded.

It is far easier to state what observability was before the hype than to describe and demonstrate the mammoth beast it is today, consisting of many yesteryear technology entanglements dragged along until there is no longer a market advantage in doing so. But we should only look back if it can help us see ahead, from where we are currently. While the current sliver of a present is undoubtedly shrouded in confusion and corruption, there is much to be gained in reflection and thinking forward. But where does one begin, and what aspect of the past would help us reflect on where things went amiss and better direct any contemplation of future possibilities?

We could look at the environment-generated problems that presented themselves at various stages along the path taken. Or the products and technologies that disrupted and dominated the mindshare of the masses over the years. Then there are the industry standards that gained some traction in the area of instrumentation and measurement – from application response measurement to more open telemetry libraries. This would also need to be framed in terms of technologies and trends that occurred elsewhere in the industry; this aspect is crucial because much of the effort of late in the observability vendor solution space is not directed at innovation but more with adapting existing techniques, technologies, and tools to the cloud, containers, and new code constructs.

Instead of digging ourselves deep down into the data and details, which many observability solutions readily encourage their users to do, I will attempt to explain the past and foresee the future of observability in terms of the two most essential conceptual elements to human experience and cognition – space (locality) and time (temporality). Let’s start with space, though it is intertwined with time.

In the beginning, the machine was monitored and managed nearly exclusively and mainly in isolation. The operator installed software, and the operating system and other tools and packages helped the operator monitor and troubleshoot performance problems. Most of the operator’s monitoring (now relabeled as observability) data, rendered on a display attached to the hardware, was transient apart from some logs. Data rarely exited from the machine except by way of printed sheets or in the operator’s mind. Of course, this is a gross simplification, but it will serve our purpose for now. What is important to note here is that there was a direct space-time connection between the human operator and machine execution. The visual perception of data was the equivalent of observability, except that it was transmitted to the human present and rarely stored. There was a feedback loop between human action, running a job, and the operating system’s performance indicators – the processor and disk lightbulbs visible in the hardware casing.

The next phase was the client-server era, with operators now having to connect remotely to machines to view and collect logs for analysis. There was no collocation anymore of human and machine, so much more of the observability data was stored on the machine but still very limited in its time window. It was still manageable with a few servers. Over time, operators implemented automation to ping each of the black boxes and collect enough information to assess some degree of state. But with sampling, there were data gaps.

Then came the beginnings of the cloud, along with containers and microservices like architectures. The machines were not only remote; they were ephemeral. The observability data could not remain remotely accessible, at least not on the source machine, which was now far more divorced from the past’s physical hardware machine. The observer of observability data needed to change with the times. No longer was it an operator. In moving from local to remote, the observer had become a bunch of scripts. But with observability data less accessible and far more transient, the solution was to pull or push the data into another far more permanent space. But there was a problem. The maturity of operational processes had not kept up with technology’s pace in monitoring and managing far more complex systems like architectures. So when it came to deciding what to collect, there was a lot of uncertainty, so engineering erred on the side of caution and tried collecting everything and anything and pushing it all down a big data pipeline without much regard for cost or value. The observability data had moved from local to many remotes and now to one big centralized space. This is today. To be sure, distributed tracing is a centralized solution. Distributed tracing is not distributed computing. While distributed tracing helps correlate execution across process boundaries, at the heart of it is the movement of trace and span data to some central store.

The current centralized approach to observability is not sustainable. The volume of useless data is growing by the day, while our ability and capacity to make sense of it is shrinking at an alarming rate. Sensibility and significance must come back into the fray; otherwise, we are destined to wander around in a fog of data wondering how we ever got to this place and so lost. We need to rethink our current approach to moving data; instead, look to distribute the cognitive computation of the situation, an essential concept that has been lost in all of this, back to machines or at least what constitutes the unit of execution today. We need to relearn how to focus on operational significance: stability, systems, signals, states, scenes, scenarios, and situations. Instead of moving data and details, we should enable the communication of a collective assessment of operational status based on behavioral signals and local contextual inferencing from each computing node. The rest is noise, and any attention given is a waste of time and counterproductive.

Let us now get to the crux of many problems we face in managing complex systems, and I would argue safeguarding the future of our world and species – time. Sorry if that all sounds very dramatic, but I firmly believe that at the heart of the major problems facing humankind is our ability to conceive and perceive time (in passing) and project forward (time-travel mentally) but to fall short in fully experiencing such projections or past recollections at the same cognitive level and emotional intensity we do the present. We stole fire from the gods, but we have yet to wield it in a less destructive and far more conservation way. We need fire to shine a light into the dark tunnel in either direction of the arrow of time. Still, we have yet to fully appreciate and accept that any (in)sight we are offered in doing so is only a scanty shimmer of what lies ahead or behind us. We are always situated in the present, and the context of the past and consequences of the future are invariably experienced diluted and distilled. We have yet to be able to step in the same river twice. We can look forward and backward, but it is never truly experienced in the same way the present is. Our current observability tools have yet to address this omission in the mind’s cognitive development and evolution. There can be no time travel without memory.

The graphic above is not necessarily a timeline of progression as observability initially started with the in-the-moment experience with direct human-to-machine communication of performance-related data when both human operator and machine were spatially collocated. That said, there was a trend that moved from the past to the present with the introduction of near real-time telemetry data collection over yesteryear logging technology. Today even near-real-time is insufficient, with organizations moving from reactive to proactive in demanding predictive capabilities. Observability deals in the past; it measures, captures, records, and collects some executed activity or generated event representing an operation or outcome. When humans consider the past, they are not thinking about metrics or logs; instead, they recall (decaying) memories of experiences. When a human operator does recall watching a metric dashboard, they do not remember the data points but instead the experience of observing. An operator might be able to recall one or two facts about the data, but this will be wrapped in the context of the episodic memory. A machine is entirely different; the past is never reconstructed in the same manner as the original execution, and the historical data does not decay naturally though it can be purged and the precision diminished over time. Instead, there is a log file or other historical store containing callouts, signposts, metrics, or messages that allude to what has happened. An operator must make sense of the past from a list of strings.

A challenge here is when there are multiple separate historical data sources. So, at the beginning of the evolution of monitoring and observability, much of the engineering effort was on fusing data, resulting in the marketing-generated requirement of a “single pane”. Unfortunately, fusion was simplistic and superficial; there was hardly any semantic level integration. Instead, the much-hyped data fusion capabilities manifested merely as the juxtapositioning of data tiles laid out in a dashboard devoid of a situation representation.

Dealing with time becomes a far more complex matter when shifting from the past to the present. Again, there is never really a present when it comes to observability. The movement into the present is achieved by reducing the interval between recording observation and rendering such in some form of visual communication to an operator. Once observability moved into the near-real-time space of the present, the visualizations and underlying models changed. Instead of the listing of logs or charting of metric samples, observability tooling concentrated more on depicting structures of networks of nodes and services along with up-to-minute health indicators. But as engineering teams competed further to reduce the time lag from minutes to seconds and below, other problems started to surface, particularly the difference in speeds between pulled and pushed data collection. Nowadays, modern observability pipelines are entirely pushed-based, which is also necessary when dealing with cloud computing dynamics and elasticity.

But time is still an ever-present problem. The amount of measurement data collected for each event instrumentation unit has increased, especially when employing distributed tracing instrumentation; it has been necessary to sample, buffer, batch, and drop payloads. Under heavy load, the bloated observability data pipelines cannot keep up in their dumb transmission of payloads.

The need to send everything and anything and keep the experience near-real-time are incompatible. In the end, we have the worst possible scenario – uncertainty about the situation and uncertainty about the quality (completeness) of the data that is meant to help us recognize the situation. Not to mention that any engineering intervention at the data pipeline level only brings us back to dealing with even more significant latency variance. You cannot have the whole cake and consume it centrally and keep it near-real-time.

When observability solutions talk up their sub-second monitoring solution, they are not describing latency anymore, but the resolution of the data displayed, which can be seconds or even minutes old before it catches the attention of an operator. It needs to be pointed out that events can only be counted or timed after completion or closure, so if you have a trace call lasting longer than a few seconds it is not correct to consider the dashboard a near-real-time view even if you were to somehow magically alter the physics elsewhere in the data and processing pipeline. If near-real-time is the thing most desired then an event must be decomposed into smaller events.

What do you do when it is impossible to experience the present in the present? You cheat by skipping ahead of time to predict what is coming next, a next that has probably already happened but has yet to be communicated to you. Here we anticipate a changing of the current situation to possibly one that is far more problematic. Unfortunately, this is a pipe dream with the current approach taken by observability, which is far too focused on data and detail in the form of traces, metrics, (event) logs. These are not things that are easy to predict in themselves. No solution will predict the occurrence of one of these phenomena from occurring, and they should not. Such phenomena will happen naturally and at scale in large quantities, but what does that tell us? Nothing, when the data we use for analysis is too far removed from what is of significance. By not solving the problem at the source with local sensing, signaling, and status inference, we made it impossible to experience the present in the moment. The natural workaround for such a time lag being prediction is not suitable with the type of data being transmitted. That has not stopped vendors claiming to offer machine learning and artificial intelligence. But in reality, and much like some current AI approaches, it is increasingly looking like a dead-end as we try to scale cognitive capacities to rapidly rising system complexities. We can expect metrics trendlines and thresholds to trigger escalating warnings—a lot of effort for not much reward. It is hard to imagine where we go from here.

The low-level data being captured in volume by observability instruments has blinded us to salient change. We’ve built a giant wall of white noise. The human mind’s perception and prediction capabilities evolved to detect changes that had significance to our survival. Observability has no such steering mechanism to guide effective and efficient measurement, modeling, and memory processes.

Companies are gorging on ever-growing mounds of observability data collected that should be of secondary concern and far less costly. Perception, action, and attention are so integrated within the human mind. Yet, we see no consideration of how controllability can be employed and cognition constructed when looking at what observability is today. It is a tall order to ask machine learning to deliver on the hype of AIOps by feeding a machine non-curated sensory data and expecting some informed prediction of value. Where are the prior beliefs to direct top-down inference when awareness and assessment of a situation are completely absent from the model? How can a machine of some intelligence communicate with a human when there is no standard conceptual model to support knowledge transfer in either direction readily? Suppose a prediction is to be made by artificial intelligence in support of human operators. In that case, we need to explain the reasoning and, more importantly, the ability to continuously train the prediction (inference) engine when it misses the mark. There are no answers to these from the point we are at; it is not even been considered.

When I began writing this article, I intended to explore at length the concept of projection in observability, especially to make a much clearer distinction between it and prediction. This projection will need to be put on hold to be materialized in a future post dedicated to this topic. But before closing, I would like to state that in the past (2013), I claimed that simulation was the future of observability eventually. Now I see it as being projection (of a situation), with simulation being a possible means of exploring scenarios.