Scaling Observability to Service Level Management

Scaling by way of abstraction, aggregation, and compression is critical in the effective and efficient service level management of large-scale and highly connected systems. Scaling here is not merely the storing of vast amounts of observability data often of questionable value. We have seen that play out in the application monitoring space with the movement to cloud hosting without much in the way of improvement in our day-to-day ability to monitor and manage. No, scaling here pertains to the operators’ (human or machine) ability to quickly observe and assess the level of service quality across hundreds, if not thousands, of interconnected systems, services, and endpoints. The primary reason for distributed tracing, metrics, and event logging for failing to deliver much-needed capabilities and benefits to systems engineering teams rests squarely with the underlying observability model. There is no natural or inherent way to transform and scale the analysis of such observability data collection to the generation of signals and inference of states.

An Operational Matrix

Current observability technologies and techniques make the fundamental mistake in thinking that someone somewhere is going to haplessly spend an enormous amount of time and money in wading through vast amounts of quantitative data and somehow miraculously churn out a large number of aggregations and rules that generate signals and alerts after the fact. Of course, this all assumes it is indeed possible to combine and relate metrics or logs or traces into something that can be evaluated by a classification function. Well, sorry to say, but in practice, this is impractical, if not impossible, to do and maintain even if it were possible to merely aggregate quantitative measurements such as the response time across different systems, services, and endpoints. Those service level management and alerting demonstrations vendors often tout as smart or intelligent only ever work with toy systems and applications because the model is woefully inadequate for this nature of qualitative performance analysis work. A diagnostic-like model is not going to transform into a useful service level model, no matter how much data or engineering effort you wastefully throw at it. Stop. Rethink.

Aggregating Matrix Models

OpenSignals, on the other hand, has been designed with the end goal in mind – a service level model that scales by way of qualitative measures (signals), composition (naming), scoring (status), and scoping (context). There are no expensive intermediary steps in getting to a signal from a mixed bag of structure, semi-structured, and unstructured data and data types. OpenSignals is primarily a signaling library. It starts with the creation of a named service within a context. It then concerns itself with the emittance or receipt of signals from the perspective of a library user (application developer or observability engineer). It is the responsibility of the OpenSignals service provider implementation (SPI) and those tasked with configuring its runtime to provide and perform the scoring and propagation of signals and inferred status over a namespace and/or across contexts within a distributed environment. Tooling and solutions built on OpenSignals concern themselves primarily with policies such as how the status of multiple endpoints merge at the service and system level, or what consensus algorithmic mechanism to be employed across a collection of process (or execution unit) instances housing services and endpoints. The OpenSignals model caters to both logical (name) and physical (context) structuring.