The Origins of Observability Signals

Profiling a Profiler

Around 2012 and 2013, we started looking at ways to optimize the recording and playback of episodic machine memories, a challenge in tuning an instrument already used to profile and optimize many other low latency software systems. How do you profile the best profiler for a particular programming language without dropping down to another lower layer in the computing stack? You shift from using the same measuring techniques of the existing instrument – moving away from timing to counting in another cheaper form. The timing of timing, or the timer, is not practical or very productive in the long run. Invariably you end up changing code at particular points in the source, hoping the benchmark results look better – very much trial and error.

Performance Predictors

The alternative is to look for predictors of performance in the code and then figure out how to capture such predictors cheaply without perturbing the system’s performance under observation. At the time, we chose to reference these predictors of performance as a Signal and the functions, the hierarchical scope of execution they occurred within, as a Boundary. Eventually, the range of what a signal referred to came to include both Operation and Outcome, not just any Phenomenonon, only those of particular significance.

Counting Probes

To explain what we initially deemed a signal, we assume you know of a Map – a popular programming data structure. A Map has two necessary operations – Get and Put. The Get function takes a key as an argument and returns a value mapped within the data structure by a previous associative put operation. Unless a Map has a backing store with some form of distributed memory cache, both operations will execute extremely fast, well below the resolution of a clock timer. The execution timing here is pretty much ineffective unless either operation is performed a large number of times within a single period. Counting an operation on a Map is far cheaper than timing, but it only makes sense in some other enclosing execution scope. What other things can we count? Well, we could measure the outcomes. The outcome for a Map::Get operation can be either a SUCCEED or FAIL. Tracking the outcome of a Map::Get function can be helpful and is commonly done when a Map is in effect a cache. But again, such a count would be far more useful within the scope of another operation. We can’t be sure that a particular outcome changes the lookup operation’s performance, but that is not to say it is not helpful to know and associate with other counts.

Under the Hood

We need to peek below the surface of the Map::Get function. In what way can the performance of the Map::Get function deviate from expectations? The hashing of the key parameter within the function call is something that is always done. Variance only ever enters the cost equation when the hash lands at an address with multiple possibilities, and each one needs to be evaluated. The larger the number of collisions and the positioning of our key within the list dictates the amount of relatively expensive comparison operations performed. The signal we are looking for is the count of key comparisons; anything above one is unexpected (in most cases). Not finding a matching key, and in turn, value, after scanning a list of possibilities, merely adds salt to the wound. If a Map uses open addressing, then for the Map::Put operation, the number of slots would determine the number probed before a free one was found. A far more expensive outcome of a Map::Put operation would be resizing the underlying storage array; here, the number of existing elements dictates the cost. So it is not enough to have a signal, say RESIZE; we also need a signal strength representing the size of the Map. Signaling here on resizing with the existing capacity is as cheap as it gets for observability.

A Million Meaningless Metrics

An observability engineer might consider turning the RESIZE signal and strength into a counter metric. Still, considering how prevalent instances of Map are within most enterprise application spaces, it would be meaningless and noisy at the aggregated process level. Here is where scoping by way of nested call boundaries helps, much like a trace. But let’s suppose our naive engineer ignored this and decided to give a unique name to every map instance and, in turn, signal metric – this is pretty much where we are with metrics today. We’ve millions of uniquely named metrics that have similar suffixes (or shapes) in some cases. The signal is now completely buried within a namespace. It is impossible to know whether including the term “resize” within a metric pertains to an underlying Map data structure. We have effectively untyped the source, signal, and strength. And because of this, we are now unable to abstract the type into more meaningful semantic models of such things as Service, Resource, or Scheduler.

Head Down in Details

Our current approaches to observability, such as metrics, logging, and tracing, have resulted in semantic blindness through a down in the details big data mindset, wherein an architect believes s/he could simplify all business workflow code by treating (abstracting) all entity and process interfaces as just another glorified Map interface with put and get operations acting as both property and function operators. With such framing, a method dispatch becomes another Map operation with the get method being used for functions and the set method for operations. In the Observability space, far more operationally meaningful concepts such as Context, Environment, Name, Service, Resource, Scheduler, Signal, and Status are being lost to a generation of site reliability engineers (SRE) in favor of arbitrary instances of Number (counter, gauge, timestamp) and String (name, trace id, span id).

Rethinking Observability

We’re not advocating for the complete abandonment of metrics and events toolkits. What we believe to be an industry-wide failure is the bottom-up approach we have taken with such kits. Instead, such yesteryear observability collectors should become consumers, via a plugin interface, to more semantically rich toolkits like OpenSignals. Instrumentation code added into an application codebase should not be interacting with metric, trace, or log event, but instead service, resource, scheduler, and related signals and states.