Numerous initiatives around Observability, sometimes referred to as Visibility in the business domain, fail to meet expectations due to engineers naively expecting that once data is being collected, all that needs to be done is to put up a dashboard before sitting back to stare blankly at large monitoring screens hoping for a signal to emerge from the pixels rendered magically. This is particularly so when users blindly adopt Grafana and Prometheus projects where data and charts have replaced, or circumvented, genuine understanding through patterns, structures, and models. This is an anti-pattern that seems to repeat consistently at organizations with insufficient expertise and experience in systems dynamics, situation awareness, and resilience engineering. Once the first data-laden dashboard is rolled out to management for prominent display within an office, it seems like the work is all but done other than to keep creating hundreds of more derivatives of the same ineffective effort. Little regard is ever again given to the system, its dynamics, and situations arising. Many projects fail in thinking, more so acting, that they can leap from data to dashboard in one jump.
This is not helped by many niche vendors talking up “unknown unknowns” and “deep systems,” which is more akin to giving someone standing on the tip of an iceberg a shovel and asking them to dig away at the surface. There is nothing profound or fulfilling to be found doing so other than discovering detail after detail and never seeing the big picture that pertains to the system moving and changing below the surface of visibility that comes from event capture that is not guided by knowledge or wisdom. The industry has gone from being dominated by a game of blame to fear, which shuts off all (re)consideration of effectiveness.
I suspect much of the continued failings in the Observability space centers around the customary referencing and somewhat confused understanding of the Knowledge (DIKW) Hierarchy. Many “next-generation” application performance monitoring (and observability) product pitches or roadmaps roll out this pyramid graphic below, explaining how they will first collect all this data, lots of it from numerous sources, and then whittle it down to knowledge throughout the company’s remaining evolution and product development.
What invariably happens is that the engineering teams get swamped by maintenance effort around data and pipelines and the never-ceasing battle to keep instrumentation kits and extensions up-to-date with changes in platforms, frameworks, and libraries. By the time some small window of stability opens up, the team has lost sight of the higher purpose and bigger picture. In a moment of panic, the team slaps on a dashboard and advanced query capabilities in a declaration of defeat by delegating effort to users. Naturally, this defecating defeat is marketed as a win for users. This sad state of affairs comes about because of seeing the hierarchy as a one-way ladder of understanding. From data, the information will emerge; from information, the knowledge will emerge, etc. Instead of aiming for vision all too often, it is data straight to visualizations. The confusion is thinking this is a bottom-up approach, whereas the layers above steer, condition, constrain the layers below by way of the continuous adaptive and transforming process. Each layer here is framing the operational context of lower layers – direct and indirectly. A vision for an “intelligent” solution comes from values and beliefs; this then contextualizes wisdom and, in turn, defines the goals that frame knowledge exploration and acquisition processing.
For knowledge to spring forth from information, various (mental) models must be selected; a selection aligned to the overarching goals. It is here where I firmly believe we have lost our way as an engineering profession. If we can call them that, our models are too far removed from purpose, goal, and context. We have confused a data storage model of trace trees, metrics, log records, and events, as a model of understanding. In the context of Observability, an example of a goal in deriving wisdom would be to obtain intelligent near-real-time situation awareness over a large connected, complex, and continually changing landscape of distributed services. Here, understanding via a situation model must be compatible and conducive to cooperative work performed by both machines and humans. Ask any vendor to demonstrate a situation’s representation, and all you will get is a dashboard with various jagged lines automatically scrolling. Nowhere to be found are signals and states, essential components of a past, present, and unfolding situation.
There is never knowledge without a model acting as a lens and filter, an augmentation of our senses and reasoning, defining what it is that is of importance – the utility and relevance of information in context. There is never information without rules, shaped by knowledge, extracting, collecting, and categorizing data. Data and information are not surrogates for a model. Likewise, a model is not a Dashboard, one built lazily and naively on top of a lake of data and information. A dashboard and many metrics, traces, and logs that come with it are not what constitutes a situation. A situation is formed and shaped by changing signals and states of structures and processes within an environment of nested contexts (observation points of assessment) – past, present, and predicted.
Models are critical when it comes to grasping at understanding in a world of increasing complexity. A model is a compact and abstract representation of a system under observation and control that facilitates conceptualization and communication about its structure and, more importantly, dynamics. Modeling is a simplification process that helps focus attention on what is of significance for higher-level reasoning, problem-solving, and prediction. Suitable models (of representation in structure and form) are designed and developed through abstraction and the need to view a system from multiple perspectives without creating a communication disconnect for all involved. Coherence is an essential characteristic of a model, as is conciseness and context. Unfortunately, introducing a model is not always as easy as a task, as it might look on paper if the abstraction does not pay off in terms of significant simplification and a shift in focus to higher levels of value. For example, Instana, where I was recently a Complexity Scientist, had some troubles in convincing many of those coming from an OpenTelemetry background that their abstraction of a
Span served a useful purpose.
This mismatch between what a developer conceptualizes at the level of instrumentation and what is presented within the tooling, visualizations, and interfaces is seen as an inconvenience – an inconvenient truth stemming from an industry that does far too much selling of meme-like nonsense and yesteryear thinking and tooling than educating in theory and practice. A focus on systems and dynamics needs to win over data and details to get back to designing and building agile, adaptive, and reliable enterprise systems.
In my current position at PostNL, I’m helping to design and develop a Control Tower for an ambitious Digital Supply Chain Platform. Depending on the domain and perspective taken, there are different possible models – objects (parcels), processes (sorting), and flows (transport). Still, it all comes down to transforming data at the sensory measurement level up into structures and then behavior along with affordance at increasing levels of abstractions, compression, and comprehension. While a Control Tower could readily track every small detail of a parcel’s movement, it would not effectively and efficiently understand the dynamics that emerge at the system level across many cooperating agents within such a highly interconnected network where promises, related to resource exchanges, are made, monitored, and adjusted in the event of disruptions. One agent’s (human or machine) model is another one’s raw data.
Aside: While it is hard not to see the importance of parcel tracking within a supply chain, at least from a customer perspective, I’ve still to have someone offer up a valid justification for distributed tracing in its current form over the approach OpenSignals takes.