Let’s consider that the primary purpose of a monitoring dashboard is to bring immediate awareness of the situation to the viewer – to call attention to the current state of affairs, and support prediction and projection of scene in which situations exist and unfold over time. When you look at today’s dashboards built with Grafana, where a metric backend like Prometheus sources the data, it is clear we are so far removed from the situation that it is no wonder that things fail and spectacularly while being watched over by operators at incredible levels of data and detail. Engineers are drowning in data and hundreds of data displays. It seems like we are destined for some time to continue digging away at the mound of data that is growing at unprecedented levels, unable to lift our head above the fog and take stock of the dysfunctional system we have allowed to develop under our watch. We have lost all sense of system stability.
Now step back and consider what it means to observe, monitor, control, and manage services. Boiling this down to the fundamentals, we are primarily concerned with the status of a service or a system (set) of services. The status of a service is something we infer from the sequence of signals emitted or received by a service-to-service interaction (exchange). In this article, we will not go into how the status is determined from signaling. It is far more important for us to consider how to reason about and visually represent the status of service from different perspectives – those services (callers) dependant on the service (ingress), the service itself (transit), and those services (callees) the service depends on (egress). It is with a focus on the status where we begin to glimpse situations.
Below is a simple table model proposal for an effective and efficient rendering of a service or system of services. This should be the starting point for any service level monitoring and management dashboard system. Each row in the table represents a possible service status value starting with OK and ending with DOWN – from best case to worse case. The table columns represent the three possible perspectives of service status quality one generally takes in a service supply chain. In the first column, ingress, we look at the service under management from the callers’ perspective, as in how do the callers view the quality of service delivered from their end? The second column represents how the service itself, via its own signaling, views its service quality. Finally, the third column represents how the service views the quality of service of the services it depends on. In the table, we can see that the situation is stable (over some time period) in that all perspectives (columns) are reporting their values with an OK status (row).
You might now ask what do the numbers themselves represent? Well, in most microservice architectures, a service has multiple endpoints or operations if offers to callers. You usually get the CREATE (POST), READ (GET), UPDATE (PUT), and DELETE operations being exposed per domain entity managed by a service. So the numbers could represent the number of (logical) endpoints that are interacted with. Though several dimensions could be tallied within the table cells, we will keep it to this, counting the endpoints exposed. So the service in question, represented by the grid model, has 10 exposed endpoints. Each of the endpoints is deemed by the service itself to be operationally OK. From the perspective of callers, all 10 endpoints are also operationally OK. The number in the rightmost column represents the number of endpoints that the service interacts with. The egress endpoints could belong to just one service or multiple services. Services within the middle of a service supply chain will usually fan out service requests to other services.
The strength of the simplicity introduced above really shines when it comes to disruptions that spread throughout a network (or chain) of interconnected services and components. In the table above, the situation has changed with 5 of the egress (callees) endpoints now being assessed as deviating from the perspective of this service under consideration. It is important to note that this does not mean the endpoints are also seen as deviating by other possible services also interacting with the same endpoints.
The table above shows the unfolding situation is now spreading, with the service itself reporting that 8 of its exposed endpoints are now considered deviating from normal operations. At this point, the callers have not sensed it. The more resilient the service is, the longer the time window between the propagation of the status change. It does not matter whether the 8 deviating endpoints of the service do, in fact, call out to the 5 deviating egress points because the influence can be indirect, especially when working state (memory, files) and resources (pools, caches) are reused and shared across requests from multiple service clients.
After a period of time, the situation has spread further afield, with callers of the service now observing that 8 of the endpoints are indeed deviating. We have kept things rather simplistic in making the totals the same in both the ingress and transit columns. The numbers indeed need not be equal. It depends on the sensitivity of the callers to how quickly the deviation is detected. We could easily have a value of 10 listed in both OK and DEVIATING rows if there were configuration differences in the caller population. For this reason, we should not place too much emphasis on what the values in the cells constitute. What is far important is understanding what the grid representation reveals about the situation and enclosing scene, particularly the dynamics of change across interactions.
What if, instead of seeing hundreds of sparklines plotted across a data-board, engineers only needed to learn how to recognize situations by the visual patterns exhibited by the service level grid model. This would be the very beginning of a new direction for the observability and monitoring industry. Each of the grids below has something to show and inform, a snapshot at some point in time, which can then be further extended when such frames are sequenced. We would finally get back to the observation of system and service dynamics at an understandable, meaningful, and explainable scale to humans. From there, controllability is practical.
Observability is a hot topic in software and system engineering circles even though it is poorly defined and largely misunderstood. At this point in time, it would seem impossible to state what observability is and how does it differ from monitoring. There is so much vendor noise as well as nonsense and misinformation regarding pillars, platforms, pipelines, unknown unknowns, and deep systems that any talk of signals, the beginnings of a path to complex systems seeing, sensing, and steering, is likely to be ignored or discarded.
It is far easier to state what observability was before the hype than to describe and demonstrate the mammoth beast it is today, consisting of many yesteryear technology entanglements dragged along until there is no longer a market advantage in doing so. But we should only look back if it can help us see ahead, from where we are currently. While the current sliver of a present is undoubtedly shrouded in confusion and corruption, there is much to be gained in reflection and thinking forward. But where does one begin, and what aspect of the past would help us reflect on where things went amiss and better direct any contemplation of future possibilities?
We could look at the environment-generated problems that presented themselves at various stages along the path taken. Or the products and technologies that disrupted and dominated the mindshare of the masses over the years. Then there are the industry standards that gained some traction in the area of instrumentation and measurement – from application response measurement to more open telemetry libraries. This would also need to be framed in terms of technologies and trends that occurred elsewhere in the industry; this aspect is crucial because much of the effort of late in the observability vendor solution space is not directed at innovation but more with adapting existing techniques, technologies, and tools to the cloud, containers, and new code constructs.
Instead of digging ourselves deep down into the data and details, which many observability solutions readily encourage their users to do, I will attempt to explain the past and foresee the future of observability in terms of the two most essential conceptual elements to human experience and cognition – space (locality) and time (temporality). Let’s start with space, though it is intertwined with time.
In the beginning, the machine was monitored and managed nearly exclusively and mainly in isolation. The operator installed software, and the operating system and other tools and packages helped the operator monitor and troubleshoot performance problems. Most of the operator’s monitoring (now relabeled as observability) data, rendered on a display attached to the hardware, was transient apart from some logs. Data rarely exited from the machine except by way of printed sheets or in the operator’s mind. Of course, this is a gross simplification, but it will serve our purpose for now. What is important to note here is that there was a direct space-time connection between the human operator and machine execution. The visual perception of data was the equivalent of observability, except that it was transmitted to the human present and rarely stored. There was a feedback loop between human action, running a job, and the operating system’s performance indicators – the processor and disk lightbulbs visible in the hardware casing.
The next phase was the client-server era, with operators now having to connect remotely to machines to view and collect logs for analysis. There was no collocation anymore of human and machine, so much more of the observability data was stored on the machine but still very limited in its time window. It was still manageable with a few servers. Over time, operators implemented automation to ping each of the black boxes and collect enough information to assess some degree of state. But with sampling, there were data gaps.
Then came the beginnings of the cloud, along with containers and microservices like architectures. The machines were not only remote; they were ephemeral. The observability data could not remain remotely accessible, at least not on the source machine, which was now far more divorced from the past’s physical hardware machine. The observer of observability data needed to change with the times. No longer was it an operator. In moving from local to remote, the observer had become a bunch of scripts. But with observability data less accessible and far more transient, the solution was to pull or push the data into another far more permanent space. But there was a problem. The maturity of operational processes had not kept up with technology’s pace in monitoring and managing far more complex systems like architectures. So when it came to deciding what to collect, there was a lot of uncertainty, so engineering erred on the side of caution and tried collecting everything and anything and pushing it all down a big data pipeline without much regard for cost or value. The observability data had moved from local to many remotes and now to one big centralized space. This is today. To be sure, distributed tracing is a centralized solution. Distributed tracing is not distributed computing. While distributed tracing helps correlate execution across process boundaries, at the heart of it is the movement of trace and span data to some central store.
The current centralized approach to observability is not sustainable. The volume of useless data is growing by the day, while our ability and capacity to make sense of it is shrinking at an alarming rate. Sensibility and significance must come back into the fray; otherwise, we are destined to wander around in a fog of data wondering how we ever got to this place and so lost. We need to rethink our current approach to moving data; instead, look to distribute the cognitive computation of the situation, an essential concept that has been lost in all of this, back to machines or at least what constitutes the unit of execution today. We need to relearn how to focus on operational significance: stability, systems, signals, states, scenes, scenarios, and situations. Instead of moving data and details, we should enable the communication of a collective assessment of operational status based on behavioral signals and local contextual inferencing from each computing node. The rest is noise, and any attention given is a waste of time and counterproductive.
Let us now get to the crux of many problems we face in managing complex systems, and I would argue safeguarding the future of our world and species – time. Sorry if that all sounds very dramatic, but I firmly believe that at the heart of the major problems facing humankind is our ability to conceive and perceive time (in passing) and project forward (time-travel mentally) but to fall short in fully experiencing such projections or past recollections at the same cognitive level and emotional intensity we do the present. We stole fire from the gods, but we have yet to wield it in a less destructive and far more conservation way. We need fire to shine a light into the dark tunnel in either direction of the arrow of time. Still, we have yet to fully appreciate and accept that any (in)sight we are offered in doing so is only a scanty shimmer of what lies ahead or behind us. We are always situated in the present, and the context of the past and consequences of the future are invariably experienced diluted and distilled. We have yet to be able to step in the same river twice. We can look forward and backward, but it is never truly experienced in the same way the present is. Our current observability tools have yet to address this omission in the mind’s cognitive development and evolution. There can be no time travel without memory.
The graphic above is not necessarily a timeline of progression as observability initially started with the in-the-moment experience with direct human-to-machine communication of performance-related data when both human operator and machine were spatially collocated. That said, there was a trend that moved from the past to the present with the introduction of near real-time telemetry data collection over yesteryear logging technology. Today even near-real-time is insufficient, with organizations moving from reactive to proactive in demanding predictive capabilities. Observability deals in the past; it measures, captures, records, and collects some executed activity or generated event representing an operation or outcome. When humans consider the past, they are not thinking about metrics or logs; instead, they recall (decaying) memories of experiences. When a human operator does recall watching a metric dashboard, they do not remember the data points but instead the experience of observing. An operator might be able to recall one or two facts about the data, but this will be wrapped in the context of the episodic memory. A machine is entirely different; the past is never reconstructed in the same manner as the original execution, and the historical data does not decay naturally though it can be purged and the precision diminished over time. Instead, there is a log file or other historical store containing callouts, signposts, metrics, or messages that allude to what has happened. An operator must make sense of the past from a list of strings.
A challenge here is when there are multiple separate historical data sources. So, at the beginning of the evolution of monitoring and observability, much of the engineering effort was on fusing data, resulting in the marketing-generated requirement of a “single pane”. Unfortunately, fusion was simplistic and superficial; there was hardly any semantic level integration. Instead, the much-hyped data fusion capabilities manifested merely as the juxtapositioning of data tiles laid out in a dashboard devoid of a situation representation.
Dealing with time becomes a far more complex matter when shifting from the past to the present. Again, there is never really a present when it comes to observability. The movement into the present is achieved by reducing the interval between recording observation and rendering such in some form of visual communication to an operator. Once observability moved into the near-real-time space of the present, the visualizations and underlying models changed. Instead of the listing of logs or charting of metric samples, observability tooling concentrated more on depicting structures of networks of nodes and services along with up-to-minute health indicators. But as engineering teams competed further to reduce the time lag from minutes to seconds and below, other problems started to surface, particularly the difference in speeds between pulled and pushed data collection. Nowadays, modern observability pipelines are entirely pushed-based, which is also necessary when dealing with cloud computing dynamics and elasticity.
But time is still an ever-present problem. The amount of measurement data collected for each event instrumentation unit has increased, especially when employing distributed tracing instrumentation; it has been necessary to sample, buffer, batch, and drop payloads. Under heavy load, the bloated observability data pipelines cannot keep up in their dumb transmission of payloads.
The need to send everything and anything and keep the experience near-real-time are incompatible. In the end, we have the worst possible scenario – uncertainty about the situation and uncertainty about the quality (completeness) of the data that is meant to help us recognize the situation. Not to mention that any engineering intervention at the data pipeline level only brings us back to dealing with even more significant latency variance. You cannot have the whole cake and consume it centrally and keep it near-real-time.
When observability solutions talk up their sub-second monitoring solution, they are not describing latency anymore, but the resolution of the data displayed, which can be seconds or even minutes old before it catches the attention of an operator. It needs to be pointed out that events can only be counted or timed after completion or closure, so if you have a trace call lasting longer than a few seconds it is not correct to consider the dashboard a near-real-time view even if you were to somehow magically alter the physics elsewhere in the data and processing pipeline. If near-real-time is the thing most desired then an event must be decomposed into smaller events.
What do you do when it is impossible to experience the present in the present? You cheat by skipping ahead of time to predict what is coming next, a next that has probably already happened but has yet to be communicated to you. Here we anticipate a changing of the current situation to possibly one that is far more problematic. Unfortunately, this is a pipe dream with the current approach taken by observability, which is far too focused on data and detail in the form of traces, metrics, (event) logs. These are not things that are easy to predict in themselves. No solution will predict the occurrence of one of these phenomena from occurring, and they should not. Such phenomena will happen naturally and at scale in large quantities, but what does that tell us? Nothing, when the data we use for analysis is too far removed from what is of significance. By not solving the problem at the source with local sensing, signaling, and status inference, we made it impossible to experience the present in the moment. The natural workaround for such a time lag being prediction is not suitable with the type of data being transmitted. That has not stopped vendors claiming to offer machine learning and artificial intelligence. But in reality, and much like some current AI approaches, it is increasingly looking like a dead-end as we try to scale cognitive capacities to rapidly rising system complexities. We can expect metrics trendlines and thresholds to trigger escalating warnings—a lot of effort for not much reward. It is hard to imagine where we go from here.
The low-level data being captured in volume by observability instruments has blinded us to salient change. We’ve built a giant wall of white noise. The human mind’s perception and prediction capabilities evolved to detect changes that had significance to our survival. Observability has no such steering mechanism to guide effective and efficient measurement, modeling, and memory processes.
Companies are gorging on ever-growing mounds of observability data collected that should be of secondary concern and far less costly. Perception, action, and attention are so integrated within the human mind. Yet, we see no consideration of how controllability can be employed and cognition constructed when looking at what observability is today. It is a tall order to ask machine learning to deliver on the hype of AIOps by feeding a machine non-curated sensory data and expecting some informed prediction of value. Where are the prior beliefs to direct top-down inference when awareness and assessment of a situation are completely absent from the model? How can a machine of some intelligence communicate with a human when there is no standard conceptual model to support knowledge transfer in either direction readily? Suppose a prediction is to be made by artificial intelligence in support of human operators. In that case, we need to explain the reasoning and, more importantly, the ability to continuously train the prediction (inference) engine when it misses the mark. There are no answers to these from the point we are at; it is not even been considered.
When I began writing this article, I intended to explore at length the concept of projection in observability, especially to make a much clearer distinction between it and prediction. This projection will need to be put on hold to be materialized in a future post dedicated to this topic. But before closing, I would like to state that in the past (2013), I claimed that simulation was the future of observability eventually. Now I see it as being projection (of a situation), with simulation being a possible means of exploring scenarios.
Today, companies face mounting pressure to demonstrate both speed and agility in an ever-changing and increasingly competitive environment. Information technology is seen as a critical enabler in adjusting to market shifts and threats and increasing customer expectations while evolving and improving offered services. For an organization to improve its capability to change the underlying systems supporting it must also change, and generally at a much faster rate to connect the past, present, and predicted future coherently with some degree of continuity. While the business focuses on charting a course from one change point to another on a timeline of services and market evolution, the computing infrastructure, and hapless engineering teams must deal with the only thing worst than change itself, and that is the transition period each of these points, moving from a discrete view of the world to one that is continuous and complicated. Not to worry, engineers believe they have everything pretty much figured out – make smaller changes but faster, much like a pilot in a fighter plane using the OODA loop to get within the loop of the enemy. Here the opponent is the business and engineering is proactively maneuvering and anticipating re-orientations as well as dictating the engagement. That is until the question is raised – how to observe and orient with an ever-growing big pile of low-level data hooked up to a bunch of dashboards.
The OODA loop places emphasis on two critical factors within an environment – time constraints and information uncertainty. The time factor is addressed by executing through the loop as fast as possible. Information uncertainty is tackled by acting accurately. The typical presentation of the model depicted below is popular because it closes the loop between sensing (observe and orient) and acting (decide and act). In the Observe phase, the focus is on data acquisition and information synthesis about the environment and the unfolding situation and interactions. The goal of the Orient phase, which follows the Observe phase, is to make sense of collected observations from an operational viewpoint. This understanding of the situation and potential scenarios that may follow on from this point in time is highly dependent on the level of expertise and experience of observers – situation assessors and decision-makers. The next step in the process is the Decide phase, in which information fed from the Orient phase determines the appropriate action(s). Finally, the Act phase is where the course of action decided upon earlier is implemented. The cycle repeats with further observations.
However, there are problems with the OODA model. It does not detail how later phases steer and influence, more specifically, self-regulate, earlier phases, and vice versa – invariably, it is seen and described as sequential without the ability to exit prematurely and then to re-enter. It also omits attention and memory and the cognitive representation of world states and models. It also lacks any deliberate planning and learning phases. The OODA model is broad in its description of the decision-making process and other than listing some of the factors pertinent to the Orient phase, it offers very little in the way of how to implement it. Mindless looping.
The OODA model’s biggest issue is that it does not capture the encompassing goal and objectives, making the loop very reactive rather than proactive. The model appeals to one of the worst trends within software engineering and services operations – big data addiction. Here effective operations management and the decision-making it entails are seen as merely a problem of insufficient data collection and information construction. Unfortunately, expanding the capacity to transmit more and more data to the cloud has not improved situation awareness; in fact, it seems to have made it more difficult, if not impossible. It’s not just simple; it’s simplistic.
OODA and Big Data do not reflect much of how human perception and cognition work to direct attention and interpret sensor signals. OODA, and much of the work currently ongoing in the Observability space, incorrectly assume engineering is principally passively reacting to environmental-sourced events – this is never more so exemplified in the design of data-ladened monitoring dashboards.
Successful service operations and management in dynamic environments enclosing highly complex systems depends on maintaining a focus on clear goals at various levels of composition and planning how to achieve and maintain them. To some degree, OODA and Big Data approach human-and-machine cognition with a simplistic, mechanistic, and data-centric viewpoint, completely ignorant of situations and scenes, intentions and inferences, signals and sequences services and states – devoid of patterns and models that could help to more effectively direct both humans and computer attention in assessing current conditions, predicting future events, and tracking the results of scripted and curated system interventions. Data, data everywhere, and not a situation to be recognized.
Let’s stop for a moment and ask ourselves the question. And I am referring to our collective here. How does one even begin to imagine that a site reliability engineer (SRE) can go from hundreds, if not thousands, of different distinct trace paths, metrics, and log patterns to the formulation of the current situation against which it is compared with past prototypical patterned situations? And if you are naive enough to believe that machine learning will solve this issue, then ask yourself how one, human or machine, can honestly and with some degree of certainty predict from such contextless data the transitions between situations and explain this to an engineer in a communicable form. Cognition is situated and yet the industry keeps offering up solutions, with even bigger problems, that further disconnect us from the situation, beyond the fact that we’re slowly sinking in a quicksand pit of data making it near impossible to act.
Numerous initiatives around Observability, sometimes referred to as Visibility in the business domain, fail to meet expectations due to engineers naively expecting that once data is being collected, all that needs to be done is to put up a dashboard before sitting back to stare blankly at large monitoring screens hoping for a signal to emerge from the pixels rendered magically. This is particularly so when users blindly adopt Grafana and Prometheus projects where data and charts have replaced, or circumvented, genuine understanding through patterns, structures, and models. This is an anti-pattern that seems to repeat consistently at organizations with insufficient expertise and experience in systems dynamics, situation awareness, and resilience engineering. Once the first data-laden dashboard is rolled out to management for prominent display within an office, it seems like the work is all but done other than to keep creating hundreds of more derivatives of the same ineffective effort. Little regard is ever again given to the system, its dynamics, and situations arising. Many projects fail in thinking, more so acting, that they can leap from data to dashboard in one jump.
This is not helped by many niche vendors talking up “unknown unknowns” and “deep systems,” which is more akin to giving someone standing on the tip of an iceberg a shovel and asking them to dig away at the surface. There is nothing profound or fulfilling to be found doing so other than discovering detail after detail and never seeing the big picture that pertains to the system moving and changing below the surface of visibility that comes from event capture that is not guided by knowledge or wisdom. The industry has gone from being dominated by a game of blame to fear, which shuts off all (re)consideration of effectiveness.
I suspect much of the continued failings in the Observability space centers around the customary referencing and somewhat confused understanding of the Knowledge (DIKW) Hierarchy. Many “next-generation” application performance monitoring (and observability) product pitches or roadmaps roll out this pyramid graphic below, explaining how they will first collect all this data, lots of it from numerous sources, and then whittle it down to knowledge throughout the company’s remaining evolution and product development.
What invariably happens is that the engineering teams get swamped by maintenance effort around data and pipelines and the never-ceasing battle to keep instrumentation kits and extensions up-to-date with changes in platforms, frameworks, and libraries. By the time some small window of stability opens up, the team has lost sight of the higher purpose and bigger picture. In a moment of panic, the team slaps on a dashboard and advanced query capabilities in a declaration of defeat by delegating effort to users. Naturally, this defecating defeat is marketed as a win for users. This sad state of affairs comes about because of seeing the hierarchy as a one-way ladder of understanding. From data, the information will emerge; from information, the knowledge will emerge, etc. Instead of aiming for vision all too often, it is data straight to visualizations. The confusion is thinking this is a bottom-up approach, whereas the layers above steer, condition, constrain the layers below by way of the continuous adaptive and transforming process. Each layer here is framing the operational context of lower layers – direct and indirectly. A vision for an “intelligent” solution comes from values and beliefs; this then contextualizes wisdom and, in turn, defines the goals that frame knowledge exploration and acquisition processing.
For knowledge to spring forth from information, various (mental) models must be selected; a selection aligned to the overarching goals. It is here where I firmly believe we have lost our way as an engineering profession. If we can call them that, our models are too far removed from purpose, goal, and context. We have confused a data storage model of trace trees, metrics, log records, and events, as a model of understanding. In the context of Observability, an example of a goal in deriving wisdom would be to obtain intelligent near-real-time situation awareness over a large connected, complex, and continually changing landscape of distributed services. Here, understanding via a situation model must be compatible and conducive to cooperative work performed by both machines and humans. Ask any vendor to demonstrate a situation’s representation, and all you will get is a dashboard with various jagged lines automatically scrolling. Nowhere to be found are signals and states, essential components of a past, present, and unfolding situation.
There is never knowledge without a model acting as a lens and filter, an augmentation of our senses and reasoning, defining what it is that is of importance – the utility and relevance of information in context. There is never information without rules, shaped by knowledge, extracting, collecting, and categorizing data. Data and information are not surrogates for a model. Likewise, a model is not a Dashboard, one built lazily and naively on top of a lake of data and information. A dashboard and many metrics, traces, and logs that come with it are not what constitutes a situation. A situation is formed and shaped by changing signals and states of structures and processes within an environment of nested contexts (observation points of assessment) – past, present, and predicted.
Models are critical when it comes to grasping at understanding in a world of increasing complexity. A model is a compact and abstract representation of a system under observation and control that facilitates conceptualization and communication about its structure and, more importantly, dynamics. Modeling is a simplification process that helps focus attention on what is of significance for higher-level reasoning, problem-solving, and prediction. Suitable models (of representation in structure and form) are designed and developed through abstraction and the need to view a system from multiple perspectives without creating a communication disconnect for all involved. Coherence is an essential characteristic of a model, as is conciseness and context. Unfortunately, introducing a model is not always as easy as a task, as it might look on paper if the abstraction does not pay off in terms of significant simplification and a shift in focus to higher levels of value. For example, Instana, where I was recently a Complexity Scientist, had some troubles in convincing many of those coming from an OpenTelemetry background that their abstraction of a Call over Span served a useful purpose.
This mismatch between what a developer conceptualizes at the level of instrumentation and what is presented within the tooling, visualizations, and interfaces is seen as an inconvenience – an inconvenient truth stemming from an industry that does far too much selling of meme-like nonsense and yesteryear thinking and tooling than educating in theory and practice. A focus on systems and dynamics needs to win over data and details to get back to designing and building agile, adaptive, and reliable enterprise systems.
In my current position at PostNL, I’m helping to design and develop a Control Tower for an ambitious Digital Supply Chain Platform. Depending on the domain and perspective taken, there are different possible models – objects (parcels), processes (sorting), and flows (transport). Still, it all comes down to transforming data at the sensory measurement level up into structures and then behavior along with affordance at increasing levels of abstractions, compression, and comprehension. While a Control Tower could readily track every small detail of a parcel’s movement, it would not effectively and efficiently understand the dynamics that emerge at the system level across many cooperating agents within such a highly interconnected network where promises, related to resource exchanges, are made, monitored, and adjusted in the event of disruptions. One agent’s (human or machine) model is another one’s raw data.
Aside: While it is hard not to see the importance of parcel tracking within a supply chain, at least from a customer perspective, I’ve still to have someone offer up a valid justification for distributed tracing in its current form over the approach OpenSignals takes.
When designing observability and controllability interfaces for systems of services, or any system for that matter, it is essential to consider how it connects the operator to the operational domain in terms of the information content, structure, and (visual) forms. What representation is most effective in the immediate grounding of an operator within a situation? How efficiently and accurately can the information be communicated, playing to perceptual and cognitive strengths? Here is where an abstraction and attention hierarchy can significantly facilitate the monitoring and management of a complex system, chiefly where situational awareness, understanding of dynamics and constraints, and the ability to predict, to some small degree into the future, are of critical importance.
While an abstraction hierarchy consists of multiple levels (or layers), it still relates to the same operational domain, much like situation awareness or service level management. At each level, a new model of terms, concepts, and relations are introduced. An observer of such a hierarchically organized system can decide which particular level of information and interaction is most suitable for the task (or goal) at hand, based on their level of expertise and knowledge of the system. Levels in the abstraction hierarchy regulate lower levels by way of constraints and conditions, whereas lower levels stimulate state changes in higher levels. In crossing the layered boundaries, an operator increases their understanding of the system. Moving from bottom to top, the operator obtains a greater awareness of what is significant in terms of the system’s functioning and goals. While moving from top to bottom, a more detailed explanation and exploration are provided of how a system brings about higher levels’ goals and plans – dynamics surface at the top and data at the bottom. It is important that there is some degree of traceability and relatability in moving in either direction.
In thinking about an abstraction level, several internal processes and functions come to mind that occur as information flows upwards from a lower level, passing through before being translated and transformed into something more significant and relevant at a level above. It starts with some stimulus deemed as a happening, an event, originating within either the environment or a lower level if one is present. A recurring stimulus becomes a signal if it pertains to the model and has benefits in terms of constraint, correction, and conditioning. Then, depending on sensitivities, much of the noise is eliminated to prioritize what is significant or signified by signaling. Following on from significance, a process of sequence recording, recalling, and recognition results in the synthesis of meaningful patterns. It starts with measurement, passing through to the model, and ending with the memory of such. Finally, the new information produced is propagated upwards to the next level of abstraction in the hierarchy, typically in a far more condensed form that gives utmost attention to changes in patterned or predicated behavior relevant to the means employed in achieving the higher-level goal(s).
With OpenSignals for Services, there is at least a minimum of three levels of abstraction to the hierarchy, leaving out service level management, which can be layered on top. The first level of abstraction pertains to the machine world’s spatial and (meta) state aspects, where a stimulus is sensed – the referent layer. Here services (or endpoints) within a distributed system of executing processes are mapped to a model consisting of Context, Environment, Service, and Name elements. The next level above, the signaling layer, is where executing instrumented code is transcribed into a series of Signal firings on references within the level below. In this second level, code execution’s complicated nature across service boundaries entailing failure, retry, and fallback mechanisms are translated into a standard set of tokens. The small set of signals is akin to RISC (reduced instruction set computer) design, but instead of a computer execution architecture, it pertains to service-to-service communication across machine and process boundaries. Depending on the service Provider implementation (SPI), the signals will be scored based on sensitivity and sequencing settings, which can in part be driven by the current state in the level above, and a status inferred for each signaling Service registered within a Context. For recording and diagnostics purposes, the signaling level also allows for the registration of a Subscriber and, in turn, Callback.
Finally, in the last of the minimum levels, the inference layer, the model is attuned to tracking changes in registered and signaling services’ Status value. Much like the signaling layer below, this level allows for event Subscription management. As an observer moves up layers in the abstraction and attention hierarchy, the frequency of events emittance is reduced significantly, and the meaning and relevance of each element and possibility in the model relate closer to the overall goal. In this top-level, observation, analysis, and (indirect) controllability of the model centers around the switches in the transitions between Status values, the sequencing and scoring of such over periods, and the possibility of predicting near future states of the system. The scale of analysis also changes, from the signaling and operational status inference of a single service to how changes in the status of a system of (interconnected) services cascade and ripple in unexpected ways through the networked of contexts – both locally and globally. It would be near impossible to do so at the signaling level for a human, at least. At all levels of the hierarchy, the same processes are involved – seeing, sensing, signifying, sequencing, scaling, and synthesizing higher-order information for representation, reduction, recognition, recording, recollection, reasoning, retrospection, reinforcement, regulation, and rectification.
I’m always on the lookout for new ways to explain and relate to the design of the OpenSignals conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent certification in Design Thinking and some readings in situational awareness. VPA is a technique that has been used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes involved by a subject from start to completion of a task. After further processing, the information captured is then analyzed to provide insights that can be used to improve performance, possibly.
An advantage of verbal protocol analysis over other cognitive investigation tasks is its richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed. Sound familiar? Yes, the same issue site reliability engineering (SRE) teams face when their primary data sources for monitoring and observability are event logging and its sibling distributed tracing.
The basic steps to Protocol Analysis are (1) record the verbalization, (2) transcribe the recording, (3) segment the transcription, (4) aggregate the segments into episodes, (5) encode the episodes, and finally (6) analyze the code sequencing patterns. During the transcribing step, researchers will interpret the recording in terms of a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses one idea or action statement. In the aggregate step, some segments are collapsed and combined into episodes to make further coding and data analysis more straightforward, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost. The most crucial step to this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, a coding scheme needs to be effective and reliable in translation and express the aspects of concern for the investigation. Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes.
In the case of an investigation into how designers think the variables might be the "design step", "knowledge", "activity", and "object". A more abstract variable set would be "subject", "predicate", and "object". A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement or event in the real world and mapping it to the appropriate code across different persons tasked with the coding.
A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be OpenSignals service status inference from signaling.
If you have ever spent time working with vast amounts of logs, metrics, and distributed tracing data, you immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services. These days, most site reliability engineers get stuck in the transcribing phases, trying to bring some uniformity and meaning to many different machine utterances, especially in logs and events. I’ve witnessed many an organization start an elaborate and ambitious initiative to remap all log records into something more relatable to service level management or situational awareness via various record-level rules and pattern matches only to abandon the initiative when the true scale of the problem is recognized and the human effort involved to not only to define but maintain such things. These tasks only ever look good in vendor demonstrations, never reflecting the change rate that all software is undergoing at present and into the future. You might ask how has Protocol Analysis in practice attempted to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand.
It should be noted that for many doing VPA in new domains that the coding scheme is defined much later in the process. Fortunately, in the Observability space of software systems, we are dealing with machines as opposed to humans, so it is far easier to introduce appropriate coding into the coding process. That is precisely what OpenSignals is offering – a fixed set of variables in the form of "service", "orientation", and "phenomenon", and a set of predefined codes for orientation and phenomenon (signals and status).
OpenSignals for Services is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work.
It is time for software services to think aloud with OpenSignals and to abandon sending meaningless blobs of data to massive event data blackholes in the cloud. It is time to standardize on a model that serves site reliability engineering and not some manufactured data addiction. Let’s have both machines and humans communicate in terms of service, signal, and status.
The orientation concept in OpenSignals seems to be somewhat novel or not so familiar amongst previous attempts at signaling. The reason for orientation, which has two values of EMIT and RECEIPT, is very straightforward once we take into account that there are two perspectives to any interaction or communication. Let’s imagine a signaling model that consists of three signals – CALL, TALK, and END. Below are two transcripts of the signals recorded at either end of a phone conversation between Alice and Bob. Before proceeding it is important to note that both parties, Alice and Bob, are modeled as sources of signals within each context at either end of the communication channel. Bob has a model that includes Alice as well as Bob himself. Alice does likewise within her context.
Can you tell who initiated the call? We could probably infer who ended the call, maybe Bob, but the initiator not so. Maybe if we had some global clock we could order across both contexts but then we run into all other types of problems related to clock synchronization or needing an omnipresent observer. An easy fix to the language would be introducing a new signal named CALLED; we could also do likewise with TALKED and ENDED. We have now doubled our small set of signals and somewhat broken the link between a signal emitted on one side to how it is received and interpreted on the other. With the concept of orientation, we don’t need to embed more meaning into a signal, a mere token, than is required. When Alice decides to call Bob, she will record within her context the event – Bob EMITCALL. This might initially look strange because Bob is not present here, and now we see he is emitting a signal. But EMIT here pertains to the locality of the operation, CALL, and not the subject, Bob. Bob’s calling is happening within the context that Alice maintains of her social network and the status she infers on the members within, including herself. When Bob picks up Alice’s call, within his context, he will record – Alice RECEIPTCALL. Because the call originated outside of his context, it is a past signal that he has now received – RECEIPT is always dealing with the perception of something that has occurred in the past and elsewhere in terms of the context maintained. The trick to understanding and reasoning about signal orientation is to separate the act of picking up the phone once it has rung, which Bob does, from the actual signal being communicated by the action. Signals are not trying to be a comprehensive representation of physical movements or events but the underlying meanings behind such or what transpires in doing so. Bob is a recipient of something that was previously recorded elsewhere. Try considering signaling to operate somewhat like a double-entry booking keeping system. An EMIT on one side will, if the signal is delivered by whatever means, will result in a corresponding RECEIPT on the other. Note: If Bob did not pick up there would be no corresponding reverse entry.
The first generation of Observability instrumentation libraries, toolkits, and agents have taken the big data pipelines of both application performance monitoring (traces, metrics) and event collection (logs) and combined them into one enormously big pipe to the cloud.
OpenSignals tackles scaling of observability on many fronts – at the machine and human interfaces and along the data pipeline starting at the source. Before discussing how OpenSignals addresses situational awareness in the context of a system of (micro)services (deferred for a future post), it is crucial to consider the foundational model elements independently of a domain.