Streamlining Observability Data Pipelines

When more is less, and less is more

The first generation of Observability instrumentation libraries, toolkits, and agents have taken the big data pipelines of both application performance monitoring (traces, metrics) and event collection (logs) and combined them into one enormously big pipe to the cloud. Beyond some primitive sampling and buffer management, these measuring and data collection components are near exclusively focused on transmitting bloated data payloads over the network. There is very little intelligence in the way of adaptivity or selectivity baked into the software that pertains to the context and situation. In some cases where an agent is deployed alongside, there is automatic discovery and instrumentation of processes, but little operational runtime regulation. You might imagine the reason for making such components and pipelines as simplistic as possible is to reduce the consumption of processor power at the source, but this would be somewhat naive. Most of the time, the overriding principle is to move the data away from the origin and into the space of the service provider keeping the components, connectors, and channels as dumb as possible with all the magic happening in the backend processes and services under the operational and change management of the service provider. The engineers building such primitive data collection components are not thinking in terms of services, situation, signals, status, significance, or other forms of synthesis of information. Because of this, much of the data collected is redundant or of little real value or importance. The amount of data duplication in transmission and the inefficiencies in converting from one encoding to another as data moves from a service to a library to an agent to an endpoint to a queue, and then onto a store is extraordinarily high by any enterprise standard, which is why data is invariably sampled or dropped when it is most needed to be collected – under unexpectedly high workload volumes.

Instrumentation libraries such as distributed tracing and logging are at the core of their design, just data sinks or black holes where much of the data goes off to be never seen again. Not much regard is given to meaning or relevance per some operational goal or evolving situation. Value is something that happens in the future when the data is required and accessed if at all. The attempted reconstruction of the situation is meant to happen following data transmission to the cloud if the situation is even considered within report tooling. Because the primary consumer of observability is painted as a human aimlessly wandering in a dataverse waiting for an unknown of unknowns to plop right in front of their face, there is simply no way to curate this data at the source before transfer. Garbage in, garbage out, at the other end. And what do data pipelines do with such waste? Much like waste depots in the real world, they use batching and compression, adding layers and layers of data on top of an ever-growing wasteland of computation and storage that heavily taxes human cognition. Well, this is how it has appeared to someone like myself coming from a low-latency high-frequency computing background. It is both shocking and amazing how convoluted and complicated any type of “intelligence” computational processing is to achieve at the backend in making all this seem somewhat coherent, consistent, and credible when clearly it is not and cannot ever be. Invariably, most vendors and customers give up on offering or expecting automatic expert advice and assistive operations beyond the HelloWorld like low-hanging fruit, resorting to providing ad hoc querying and custom dashboard capabilities, effectively pushing the problem elsewhere and then claiming this dereliction of service is a brand new thing – a platform. The community accepts this because, for most, the act of data collection and dashboard creation is rewarding, and no one seems to know any better. When someone does challenge such efforts’ effectiveness and efficiency, the hundreds, if not thousands of dashboards, are wheeled out to spraypaint a picture of just how complex reality is, not how simple things could and should be. Today’s site reliability engineers (SRE) are so fearful of not collecting all pieces of data. However, at the same time, they willingly accept the fact that their situational awareness is still at ground zero, and they are running blind of what truly matters – signals and states.

The next generation of Observability technologies and tooling will most likely take two distinctly different trajectories from the ever faltering middle ground that distributed tracing and event logging currently represent. The first trajectory, the high-value road, will introduce new techniques and models to address complex and coordinated system dynamics in a collective social context rebuilding a proper foundation geared to aiding both humans and artificial agents assess and understand the current and predicted (projected) state of a system of services, resources, and schedulers, in terms of operational goals such as systems service level management. Simultaneously, there will be a strong push for capturing and reconstructing software execution flow within and across services via episodic machine memory mirroring this being the high fidelity road of near-real-time simulated playback activated on-demand by situational awareness tooling following signal and status changes that direct both human and machine operator attention. This second trajectory will not be bloated much like today’s pipelines because it will focus near exclusively on replicating fine-grain method executions and not invocation parameters, request payloads, and whatever baggage items developers like to package in with tracing and logging records without much thought for utility and overhead. The goal here is to mimic behavior, not to be yet another failed attempt at journaling data across boundaries. We need to restore the balance between collection and cognition and its control.

OpenSignals offers a solution that vastly streamlines the Observability data pipeline by shifting much of the computation from the backend to the source (systems and services) while being substantially more efficient than current data collection components. Instead of collecting many arbitrary data values alongside events (log records or trace spans) captured, OpenSignals concentrates on what is significant and relevant to the situation and operational goal in the form of services, signals, and inferred states. Inference processing, where a service’s status is derived from the sequencing of emitted signals, a set of sixteen behavioral codes, is performed locally. No data transmission needs to occur at the signaling level, though OpenSignals allows for such cases via callbacks if need be. Even when transmitted remotely as an event, a signal event need only consist of three fields – the service name, the orientation token, and the signal token. This is a tiny fraction of what is involved in sending a log record or distributed trace, especially when stack traces, tags, labels, events, and fields are factored in. OpenSignals deals in single-digit byte-sized events, whereas all other yesteryear instrumentation libraries size their events in kilobytes. You would be mistaken to believe that observability is significantly reduced by focusing on signals and status tokens. It is quite the opposite. Because of the efficient design of OpenSignals, a service need not be just an endpoint or exit point in some distributed workflow. A service can be as small as a block of code within a method executed millions of times a second within a process or runtime. OpenSignals allows you to decompose a microservice into hundreds of sub-services without significantly perturbing processing times that would be the case with distributed tracing or event logging. Of course, the degree of new code coverage will depend on the underlying service provider implementation of OpenSignals deployed and the configuration of plugins to be installed. It is expected that for most large scale systems that OpenSignals will only transmit status changes across the network for collective intelligence, where the inferred status values for a service, taken from multiple sources, will be aggregated by way of set policies. Further to this, status changes will be propagated into higher system contexts and then forwarded onto more sophisticated supervision and control routines for profiling and prediction purposes – fast, cheap, scalable, and effective.

If you are interested in a far more thorough and objective viewpoint, you cannot do any better reading Mark Burgess’s paper.