Scaling Observability to Service Level Management

Scaling: Abstract • Aggregate • Compress

Scaling by way of abstraction, aggregation, and compression is critical in the effective and efficient service level management of large-scale and highly connected systems. Scaling here is not merely the storing of vast amounts of observability data often of questionable value. We have seen that play out in the application monitoring space with the movement to cloud hosting without much in the way of improvement in our day-to-day ability to monitor and manage. No, scaling here pertains to the operators’ (human or machine) ability to quickly observe and assess the level of service quality across hundreds, if not thousands, of interconnected systems, services, and endpoints.

Tracing: Failing at Scaling

The underlying observability model is the primary reason for distributed tracing, metrics, and event logging failing to deliver much-needed capabilities and benefits to systems engineering teams. There is no natural or inherent way to transform and scale such observability data collection analysis to generate signals and inferring states.

Passing the Buck(et)

Current observability technologies and techniques make the fundamental mistake of thinking that someone somewhere will haplessly spend an enormous amount of time and money wading through vast amounts of quantitative data. And somehow miraculously churning out a large number of aggregations and rules that generate signals and alerts after the fact. Of course, this all assumes it is indeed possible to combine and relate metrics or logs or traces into something that a classification function can evaluate. Well, sorry to say, but in practice, this is impractical, if not impossible, to do and maintain even if it were possible to merely aggregate quantitative measurements such as the response time across different systems, services, and endpoints.

Data Driven vs Model Managed

Those service level management and alerting demonstrations vendors often tout as smart or intelligent only ever work with toy systems and applications because the model is woefully inadequate for this nature of qualitative performance analysis work. A diagnostic-like model will not transform into a helpful service level model, no matter how much data or engineering effort you wastefully throw at it. Stop. Rethink.

A Goal Oriented Model

OpenSignals, on the other hand, has been designed with the end goal in mind – a service level model that scales by way of qualitative measures (signals), composition (naming), scoring (status), and scoping (context). There are no expensive intermediary steps in getting to a signal from a mixed bag of structured, semi-structured, and unstructured data and data types. OpenSignals is primarily a signaling library. It starts with the creation of a named service within a context. It then concerns itself with the emittance or receipt of signals from the perspective of a library user – an application developer or observability engineer.

An Open Service Provider Interface

It is the responsibility of the OpenSignals service provider interface (SPI) implementation and those tasked with configuring its runtime to provide and perform the scoring and propagation of signals and inferred status over a namespace and across contexts within a distributed environment. Tooling and solutions built on OpenSignals concern themselves primarily with policies such as how the status of multiple endpoints merge at the service and system-level or what consensus algorithmic mechanism to be employed across a collection of process (or execution unit) instances housing services and endpoints. The OpenSignals model caters to both logical (name) and physical (context) structuring.

SERVICE
STATUS
INGRESS
CALLERS
TRANSIT
SELF
EGRESS
CALLEES