Author: William Louth

From Data to Dashboard – An Observability Anti-Pattern

Numerous initiatives around Observability, sometimes referred to as Visibility in the business domain, fail to meet expectations due to engineers naively expecting that once data is being collected, all that needs to be done is to put up a dashboard before sitting back to stare blankly at large monitoring screens hoping for a signal to emerge from the pixels rendered magically. This is particularly so when users blindly adopt Grafana and Prometheus projects where data and charts have replaced, or circumvented, genuine understanding through patterns, structures, and models. This is an anti-pattern that seems to repeat consistently at organizations with insufficient expertise and experience in systems dynamics, situational awareness, and resilience engineering. Once the first data-laden dashboard is rolled out to management for prominent display within an office, it seems like the work is all but done other than to keep creating hundreds of more derivatives of the same ineffective effort. Little regard is ever again given to the system, its dynamics, and situations arising. So many projects fail in thinking, more so acting, that they can simply leap from data to dashboard in one jump. This is not helped by many niche vendors talking up “unknown unknowns” and “deep systems”, which is more akin to giving someone standing on the tip of an iceberg a shovel and asking them to dig away at the surface. There is nothing profound or fulfilling to be found doing so other than discovering detail after detail and never seeing the big picture that pertains to the system moving and changing below the surface of visibility that comes from event capture that is not guided by knowledge and in turn wisdom. The industry has gone from being dominated by a game of blame to one of fear, that shuts off all (re)consideration of effectiveness.

I suspect much of the continued failings in the Observability space centers around the customary referencing and somewhat confused understanding of the Knowledge (DIKW) Hierarchy. Many “next-generation” application performance monitoring (and observability) product pitches or roadmaps roll out this pyramid graphic below, explaining how they will first collect all this data, lots of it from numerous sources, and then whittle it down to knowledge throughout the company’s remaining evolution and product development. What invariably happens is that the engineering teams get swamped by maintenance effort around data and pipelines and the never-ceasing battle to keep instrumentation kits and extensions up-to-date with changes in platforms, frameworks, and libraries. By the time some small window of stability opens up, the team has lost sight of the higher purpose and bigger picture. In a moment of panic, the team slaps on a dashboard and advanced query capabilities in a declaration of defeat by delegating effort to users. Naturally, this defecating defeat is marketed as a win for users. This sad state of affairs comes about because of seeing the hierarchy as a one-way ladder of understanding. From data, the information will emerge; from information, the knowledge will emerge, etc. Instead of aiming for vision all too often, it is data straight to visualizations. The confusion is thinking this is a bottom-up approach, whereas the layers above steer, condition, constrain the layers below by way of the continuous adaptive and transforming process. Each layer here is framing the operational context of lower layers – direct and indirectly. A vision for an “intelligent” solution comes from values and beliefs; this then contextualizes wisdom and, in turn, defines the goals that frame knowledge exploration and acquisition processing.

For knowledge to spring forth from information, various (mental) models must be selected; a selection aligned to the overarching goals. It is here where I firmly believe we have lost our way as an engineering profession. Our models, if we can call them that, are too far removed from purpose, goal, and context. We have confused a data storage model of trace trees, metrics, log records, and events, as a model of understanding. In the context of Observability, an example of a goal in deriving wisdom would be to obtain intelligent near-real-time situational awareness over a large connected, complex, and continually changing landscape of distributed services. Here, understanding via a model of a situation must be compatible and conducive to cooperative work performed by both machine and human. Ask any vendor to demonstrate a situation’s representation, and all you will get is a dashboard with various jagged lines automatically scrolling. Nowhere to be found are signals and states, essential components of a past, present, and unfolding situation.

There is never knowledge without a model acting as a lens and filter, an augmentation of our senses and reasoning, defining what it is that is of importance – the utility and relevance of information in context. There is never information without rules, shaped by knowledge, extracting, collecting, and categorizing data. Data and information are not surrogates for a model. Likewise, a model is not a Dashboard, one built lazily and naively on top of a lake of data and information. A dashboard and many metrics, traces, and logs that come with it are not what constitutes a situation. A situation is formed and shaped by changing signals and states of structures and processes within an environment of nested contexts (observation points of assessment) – past, present, and predicted.

Models are critical when it comes to grasping at understanding in a world of increasing complexity. A model is a compact and abstract representation of a system under observation and control that facilitates conceptualization and communication about its structure and, more importantly, dynamics. Modeling is a simplification process that helps focus attention on what is of significance for higher-level reasoning, problem-solving, and prediction. Suitable models (of representation in structure and form) are designed and developed through abstraction and the need to view a system from multiple perspectives without creating a communication disconnect for all involved. Coherence is an essential characteristic of a model, as is conciseness and contextual. Unfortunately, introducing a model is not always as easy as a task, as it might look on paper if the abstraction does not pay off in terms of significant simplification and a shift in focus to higher levels of value. For example, Instana, where I was recently a Complexity Scientist, had some troubles in convincing many of those coming from an OpenTelemetry background that their abstraction of a Call over Span served a useful purpose. This mismatch between what a developer conceptualizes at the level of instrumentation and what is presented within the tooling, visualizations, and interfaces is seen as an inconvenience – an inconvenient truth stemming from an industry that does far too much selling of meme-like nonsense and yesteryear thinking and tooling than educating in theory and practice. A focus on systems and dynamics needs to win over data and details to get back to designing and building agile, adaptive, and reliable enterprise systems.

In my current position at PostNL, I’m helping to design and develop a Control Tower for an ambitious Digital Supply Chain Platform. Depending on the domain and perspective taken, there are different possible models – objects (parcels), processes (sorting), and flows (transport). Still, it all comes down to transforming data at the sensory measurement level up into structures and then behavior along with affordance at increasing levels of abstractions, compression, and comprehension. While a Control Tower could readily track every small detail of a parcel’s movement, it would not effectively and efficiently understand the dynamics that emerge at the system level across many cooperating agents within such a highly interconnected network where promises, related to resource exchanges, are made, monitored, and adjusted in the event of disruptions. One agent’s (human or machine) model is another one’s raw data.

Aside: While it is hard not to see the importance of parcel tracking within a supply chain, at least from a customer perspective, I’ve still to have someone offer up a valid justification for distributed tracing in its current form over the approach OpenSignals takes.

Hierarchies of Attention and Abstraction in Observability

When designing observability and controllability interfaces for systems of services, or any system for that matter, it is essential to consider how it connects the operator to the operational domain in terms of the information content, structure, and (visual) forms. What representation is most effective in the immediate grounding of an operator within a situation? How efficiently and accurately can the information be communicated, playing to perceptual and cognitive strengths? Here is where an abstraction and attention hierarchy can significantly facilitate the monitoring and management of a complex system, chiefly where situational awareness, understanding of dynamics and constraints, and the ability to predict, to some small degree into the future, are of critical importance.

While an abstraction hierarchy consists of multiple levels (or layers), it still relates to the same operational domain, much like situational awareness or service level management. At each level, a new model of terms, concepts, and relations are introduced. An observer of such a hierarchically organized system can decide which particular level of information and interaction is most suitable for the task (or goal) at hand, based on their level of expertise and knowledge of the system. Levels in the abstraction hierarchy regulate lower levels by way of constraints and conditions, whereas lower levels stimulate state changes in higher levels. In crossing the layered boundaries, an operator increases their understanding of the system. Moving from bottom to top, the operator obtains a greater awareness of what is significant in terms of the system’s functioning and goals. While moving from top to bottom, a more detailed explanation and exploration are provided of how a system brings about higher levels’ goals and plans – dynamics surface at the top and data at the bottom. It is important that there is some degree of traceability and relatability in moving in either direction.

In thinking about an abstraction level, several internal processes and functions come to mind that occur as information flows upwards from a lower level, passing through before being translated and transformed into something more significant and relevant at a level above. It starts with some stimulus deemed as a happening, an event, originating within either the environment or a lower level if one is present. A recurring stimulus becomes a signal if it pertains to the model and has benefit in terms of constraint, correction, and conditioning. Then, depending on sensitivities, much of the noise is eliminated to prioritize what is significant or signified by signaling. Following on from significance, a process of sequence recording, recalling, and recognition results in the synthesis of meaningful patterns. It starts with measurement, passing through to the model, and ending with the memory of such. Finally, the new information produced is propagated upwards to the next level of abstraction in the hierarchy, typically in a far more condensed form that gives utmost attention to changes in patterned or predicated behavior relevant to the means employed in achieving the higher-level goal(s).

With OpenSignals for Services, there is at least a minimum of three levels of abstraction to the hierarchy, leaving out service level management, which can be layered on top. The first level of abstraction pertains to the machine world’s spatial and (meta) state aspects, where a stimulus is sensed – the referent layer. Here services (or endpoints) within a distributed system of executing processes are mapped to a model consisting of Context, Environment, Service, and Name elements. The next level above, the signaling layer, is where executing instrumented code is transcribed into a series of Signal firings on references within the level below. In this second level, code execution’s complicated nature across service boundaries entailing failure, retry, and fallback mechanisms are translated into a standard set of tokens. The small set of signals is akin to RISC (reduced instruction set computer) design, but instead of a computer execution architecture, it pertains to service-to-service communication across machine and process boundaries. Depending on the service Provider implementation (SPI), the signals will be scored based on sensitivity and sequencing settings, which can in part be driven by the current state in the level above, and a status inferred for each signaling Service registered within a Context. For recording and diagnostics purposes, the signaling level also allows for the registration of a Subscriber and, in turn, Callback.

Finally, in the last of the minimum levels, the inference layer, the model is attuned to tracking changes in registered and signaling services’ Status value. Much like the signaling layer below, this level allows for event Subscription management. As an observer moves up layers in the abstraction and attention hierarchy, the frequency of events emittance is reduced significantly, and the meaning and relevance of each element and possibility in the model relate closer to the overall goal. In this top-level, observation, analysis, and (indirect) controllability of the model centers around the switches in the transitions between Status values, the sequencing and scoring of such over periods, and the possibility of predicting near future states of the system. The scale of analysis also changes, from the signaling and operational status inference of a single service to how changes in the status of a system of (interconnected) services cascade and ripple in unexpected ways through the networked of contexts – both locally and globally. It would be near impossible to do so at the signaling level for a human, at least. At all levels of the hierarchy, the same processes are involved – seeing, sensing, signifying, sequencing, scaling, and synthesizing of higher-order information for the purpose of representation, reduction, recognition, recording, recollection, reasoning, retrospection, reinforcement, regulation, and rectification.

Observability via (Verbal) Protocol Analysis

I’m always on the lookout for new ways to explain and relate to the design of the OpenSignals conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent certification in Design Thinking and some readings in situational awareness. VPA is a technique that has been used by researchers across many domains including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes involved by a subject from start to completion of a task. After further processing, the information captured is then analyzed to provide insights that can be used to improve performance, possibly. An advantage of verbal protocol analysis, over other cognitive investigation tasks, is its richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed. Sound familiar? Yes, the same issue site reliability engineering (SRE) teams face when their primary data sources for monitoring and observability are event logging, and its sibling distributed tracing.

The basic steps to Protocol Analysis are (1) record the verbalization, (2) transcribe the recording, (3) segment the transcription, (4) aggregate the segments into episodes, (5) encode the episodes, and finally (6) analyze the code sequencing patterns. During the transcribing step, researchers will interpret the recording in terms of a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses a statement of one idea or action. In the aggregate step, some segments are collapsed and combined into episodes to make further coding and data analysis more straightforward, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost. The most crucial step to this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, it is essential for a coding scheme to be effective and reliable in translation and express the aspects of concern for the investigation. Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes. In the case of an investigation into how designers think the variables might be the "design step", "knowledge", "activity", and "object". A more abstract variable set would be "subject", "predicate", and "object". A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement, or event in the real-world, and mapping it to the appropriate code across different persons tasked with the coding. A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be OpenSignals service status inference from signaling.

If you have ever spent time working with vast amounts of logs, metrics, and distributed tracing data, you immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services. These days, most site reliability engineers get stuck in the transcribing phases, trying to bring some uniformity and meaning to many different machine utterances, especially in logs and events. I’ve witnessed many an organization start an elaborate and ambitious initiative to remap all log records into something more relatable to service level management or situational awareness via various record-level rules and pattern matches only to abandon the initiative when the true scale of the problem is recognized and the human effort involved to not only to define but maintain such things. These tasks only ever look good in vendor demonstrations, never reflecting the change rate that all software is undergoing at present and into the future. You might ask how has Protocol Analysis in practice attempted to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand. It should be noted that for many doing VPA in new domains that the coding scheme is defined much later in the process. Fortunately, in the Observability space of software systems, we are dealing with machines as opposed to humans, so it is far easier to introduce appropriate coding into the coding process. That is precisely what OpenSignals is offering – a fixed set of variables in the form of "service", "orientation", and "phenomenon", and a set of predefined codes for orientation and phenomenon (signals and status).

OpenSignals for Services is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work.

It is time for software services to think aloud with OpenSignals and to abandon sending meaningless blobs of data to massive event data blackholes in the cloud. It is time to standardize on a model that serves the purpose of site reliability engineering and not some manufactured data addiction. Let’s have both machines and humans communicate in terms of service, signal, and status.

If you are interested in where we go from here, having managed to spend more time in analysis, then do yourself a favor and read Mark Burgess’s recent research paper – The Semantic Spacetime Hypothesis A Guide the Semantic Spacetime Project Work.

Orientating Observability

The orientation concept in OpenSignals seems to be somewhat novel or not so familiar amongst previous attempts at signaling. The reason for orientation, which has two values of EMIT and RECEIPT, is very straightforward once we take into account that there are two perspectives to any interaction or communication. Let’s imagine a signaling model that consists of three signals – CALL, TALK, and END. Below are two transcripts of the signals recorded at either end of a phone conversation between Alice and Bob. Before proceeding it is important to note that both parties, Alice and Bob, are modeled as sources of signals within each context at either end of the communication channel. Bob has a model that includes Alice as well as Bob himself. Alice does likewise within her context.

Can you tell who initiated the call? We could probably infer who ended the call, maybe Bob, but the initiator not so. Maybe if we had some global clock we could order across both contexts but then we run into all other types of problems related to clock synchronization or needing an omnipresent observer. An easy fix to the language would be introducing a new signal named CALLED; we could also do likewise with TALKED and ENDED. We have now doubled our small set of signals and somewhat broken the link between a signal emitted on one side to how it is received and interpreted on the other. With the concept of orientation, we don’t need to embed more meaning into a signal, a mere token, than is required. When Alice decides to call Bob, she will record within her context the event – Bob EMIT CALL. This might initially look strange because Bob is not present here, and now we see he is emitting a signal. But EMIT here pertains to the locality of the operation, CALL, and not the subject, Bob. Bob’s calling is happening within the context that Alice maintains of her social network and the status she infers on the members within, including herself. When Bob picks up Alice’s call, within his context, he will record – Alice RECEIPT CALL. Because the call originated outside of his context, it is a past signal that he has now received – RECEIPT is always dealing with the perception of something that has occurred in the past and elsewhere in terms of the context maintained. The trick to understanding and reasoning about signal orientation is to separate the act of picking up the phone once it has rung, which Bob does, from the actual signal being communicated by the action. Signals are not trying to be a comprehensive representation of physical movements or events but the underlying meanings behind such or what transpires in doing so. Bob is a recipient of something that was previously recorded elsewhere. Try considering signaling to operate somewhat like a double-entry booking keeping system. An EMIT on one side will, if the signal is delivered by whatever means, will result in a corresponding RECEIPT on the other. Note: If Bob did not pick up there would be no corresponding reverse entry.

Streamlining Observability Data Pipelines

The first generation of Observability instrumentation libraries, toolkits, and agents have taken the big data pipelines of both application performance monitoring (traces, metrics) and event collection (logs) and combined them into one enormously big pipe to the cloud.

>

Scaling Observability via Attention and Awareness

OpenSignals tackles scaling of observability on many fronts – at the machine and human interfaces and along the data pipeline starting at the source. Before discussing how OpenSignals addresses situational awareness in the context of a system of (micro)services (deferred for a future post), it is crucial to consider the foundational model elements independently of a domain.

>

The need for a new more modern Reliability Stack

In this post, we consider why it may be time to abandon the approach to service level management that has been strongly advocated for in the Google Site Reliability Engineering book series. But before proceeding, let us consider some excerpts from the O’Reilly Implementing Service Level Objectives Book to set the stage for a re-examination in the context of modern environments.

“…a proper level of reliability is the most important operational requirement of a Service…There is no universal answer to the question of where to draw the line between SLI and SLO, but it turns out this doesn’t really matter as long as you’re using the same definitions throughout…SLIs are the single most important part of the Reliability Stack…You may never get to the point of having reasonable SLO targets…The key is to have metrics that enable you to make statements that are either true or not.”

At the bottom of the Google Reliability Stack are service level indicators (SLI). These are measures of some aspect of the service quality, such as availability, latency, or failures. As indicated above, the Google Site Reliability Engineering teaching likes to treat them or transform then into binary values. The service is accessible or not. The service is unsatisfactorily slow or not. The response is good or not. There is no grey area here because of the need to turn such measures into a ratio of bad events to good events. A significant amount of confusion arises when it comes to the next layer in the Reliability Stack, where targets are defined – Service Level Objectives. This confusion mainly occurs when SRE teams make an SLI measure a measure of achieving a goal, which you would expect to be defined by an objective. In such cases, the difference between an SLI and SLO pertains mostly to windowing – some aspect of the ratio distribution is compared with some target value over a specified time interval. Layered on top of the SLOs are the Error Budgets used to track the difference between the indicator and the objective, again over time, and mostly at a different resolution suitable for policing and steering change management. Measures on measures on measures – and all quantitative.

It is unfortunate that so little if any critical thinking enters engineers’ minds once Google is mentioned as the originator of something because there are some serious design and operational issues. Google itself is open and honest about this in stating that many organizations adopting a service level objectives approach to reliability will not even get beyond defining service level indicators. Even then, service level indicators and objectives are invariably limited to edge entry points to a service or system of services. Many organizations don’t get off to a good start along this journey because beneath the service level indicator layer is the messy and increasing bloated world of metrics, traces, and logs. The notion of Service and a Consumer of such is all but lost amidst this data fog.

Once an organization manages to get passed translating data into information suitable for indicators and objectives, they face the most significant challenge to this whole model, and that is for every service level indicator, there is at least one service level objective, and for every service level objective, there is at least one error budget. This is a stack of similarly sized layers. Things get completely out of control and extremely costly (except maybe for the likes of Google) when multiple service level objectives are defined for a service level indicator. Here is where site reliability engineering becomes a profession of glorified spreadsheet data entry clerks. And because there is so much tight coupling between each layer, any change in one layer ripples throughout, creating an unwieldy maintenance effort. Being able to consolidate and compress the model in the form of systems, to reduce the management burden, is made complicated by the reliance on quantitative measures rather than qualitative. This just does not scale unless you’re Google. There are 581 pages in the O’Reilly book, that should be a sufficient warning in itself that simplicity is not to be found here.

What is strikingly odd about the whole site reliability engineering effort where the human user is placed front and center is that on the whole, there is nothing particular humane about it, except that Google and those who have adopted this approach focus exclusively in their doctrine on measuring and monitoring primarily at the edge and solely concerning human-to-machine interactions. System engineers need a new model that scales both up and down and can be applied effectively and efficiently at the edge and within a system of services. Otherwise, there will be a reliability model for this and another model for that. To realistically scale, the lower layers’ cost and complications need to be significantly reduced, moving up and outwards to other parties. The service level management language needs to be significantly simplified; instead of talking about objectives in terms of four or five nines, operations’ attention should be on a standard set of signals (outcomes and operations) and an even smaller set of status values. The primary service management task for operations should be configuring the scoring of sequences of signals and status changes between services. Inter and intra-communication should near exclusively relate to the subjective view of meaningful service states. Just that!

Appendix A: Service Level Objective Examples

Google SRE
99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers)

Google SRE
90% of Get RPC calls will complete in less than 1 ms
99% of Get RPC calls will complete in less than 10 ms
99.9% of Get RPC calls will complete in less than 100 ms

OpenSignals
Service A will have a subjective OK status 99% of the time

OpenSignals has 16 built-in signals used to infer the status of a service, whereas, in the Google SLO specification above, there is only one signal referenced. Typically, there will be at least three service level indicators, and in turn, three objectives and error budgets.

The Unchanging of Observability and Monitoring

In looking back over 20 years of building application performance monitoring and management tooling, it seems that not much has changed or achieved beyond that today’s tooling collects data from far more data sources, offers more attractive web-based interfaces and dashboards than previously, all of which can be rendered in dark mode. That last one is not a joke; one application performance monitoring vendor, now turned observability platform vendor, just yesterday announced dark mode as a killer feature.

Why is observability so much like what monitoring was back in 2000 when I started designing and developing my first profiling tool named JDBInsight? I get there is a repetitive cycle to much of what happens in fashion, but this is systems engineering. I suspect that not much has fundamentally changed to the degree a past me had hoped because of everything else changing within the environment of the tooling. Tooling has changed, but yet nothing has changed. That does not make sense, or does it? Much of the change touted by product marketing departments relates to engineering efforts to keep tooling applicable in these new environments of containers, cloud, and microservices. Vendors like Instana, Dynatrace, AppDynamics, and NewRelic spend a considerable amount of their engineering budget simply maintaining instrumentation extensions for the hundreds of platforms, products, projects, and programming languages. So when I say that not much has changed, I am referring to the positioning of tooling on a map of progress like the one shown below. Nearly all vendors listed above are stuck within the environment segment unable to deliver real breakthroughs that would effectively change the operational monitoring and management landscape for them and their customers.

Cognition, control, and communication are still largely deferred and delegated to humans, outside of tooling. Application performance monitoring vendors can keep talking up “intelligence” without ever having to deliver on what many, outside of the computing industry, consider intelligence to be – (re)action appropriate to the context, stimulus, and goal setting. There can never be real human-like intelligence delivered as a software service without, at minimum, the ability to link past and predicted observation to controllability – an intervention following awareness and reasoning of a situation. Today, it is next to impossible to automate the linking of observability to controllability because the shared communication model, internal and external to tooling and humans, does not exist.

Cognition and control will never emerge from data and details. Traces, metrics, and logs are just too low-level and noisy to be used as an effective and efficient model for tracking, predicting, and learning from human and machine interventions within a system. Irrespective, such yesteryear approaches are not sustainable. In the end, observability and controllability need to be embedded directly within the application software itself. The imbuing of software with self-reflection and self-adaptability has not occurred because observability instrumentation rarely considers the need for local decision making and steering through control valves or other similar control theory technologies and techniques. Instead of thinking about data, pipelines, and sinks, engineers need to refocus on the significance in signals and how they should be scored to infer a state; otherwise, the next 20 years will be much the same.

OpenSignals is the right kind of Domain-Oriented Observability

For many engineers new to observability, and application performance monitoring for that matter, domain-oriented observability looks very much like adding interceptors and callbacks into an application codebase to simplify the calling out to a generic data collector library such as distributed tracing, metrics, or even logging as a last resort. Unfortunately, the domain-oriented approach rarely reflects the needs of a useful service level management framework because it becomes lost within the business domain’s details. The pendulum seems to swing between two extremes, highly domain-specific named callbacks, or generic non-descriptive data sinks.

A highly domain-specific approach to observability ends up with hundreds of <Task>Instrumentation interceptor like interfaces containing various lifecycle hook methods that mirror the business function, such as startCreateOrder, failedCreateOrder, and completedCreateOrder – a significant step backward from what was hoped with Aspect-Oriented Programming (AOP). This approach lacks the required level of abstraction and (conceptual) model that is needed for the effective operational management of a system of services, which is pretty much what software is and has always been to some extent. The concept of a managed service is all but absent here, as are the standard set of signals one could easily imagine being captured in observing the service interactions.

At the other end of the scale, we have the yesteryear generic data-centric approaches being dragged kicking and screaming into the modern world of software microservices architecture and the observability requirements that come with it. Again common service level management concepts such as Service, Signal, and Status, are amiss here wholly. Instead, the instrumentation library interfaces consist of calls to start and stop a Trace or Span, increment and decrement a Counter, or log a LogRecord to a Logger.

OpenSignals for Services, on the other hand, strikes the perfect balance between these two extreme views of observability. Its model is specifically designed for service level management needs with the introduction of a set of fifteen Signal values and five Status values that can be fired by and associated with a Service, respectively. The domain is not the business. The domain is not the data. The domain is the system of services and the interactions that take place. Here methods on the Service interface capture what is essential to observability, including start, stop, call, fail, succeed, recourse, retry, reject, elapse, drop, delay, etc.