The Origins of Observability Signals

In around 2012 and 2013, I started looking at ways to optimize the recording and playback of episodic machine memories – a challenge in trying to tune an instrument that was already an instrument used to profile and optimize many other low latency software systems. How do you profile the best profiler for a particular programming language without dropping down to another lower layer in the computing stack? You shift from using the same measuring techniques of the existing instrument – moving away from timing to counting in another cheaper form. The timing of timing, or the timer, is not practical or very productive in the long run. Invariably you end up changing code at particular points in the source, hoping the benchmark results look better – very much trial and error.

The alternative is to look for predictors of performance in the code and then figure out how to capture such predictors cheaply without perturbing the system’s performance under observation. At the time, I choose to reference these predictors of performance as a Signal and the functions, the hierarchical scope of execution they occurred within, as a Boundary. Eventually, the scope of what a signal referred to came to include both Operation and Outcome, not just any Phenomenonon, only those of special significance.

To explain what I initially deemed a signal, I am going to assume you know of a Map – a popular programming data structure. A Map has two necessary operations – get and put. The get function takes a key as an argument and returns a value mapped to within the data structure by a previous associative put operation. Unless a map has a backing store with some form of distributed memory cache, both operations are going to execute extremely fast, well below the resolution of a clock timer. The timing of execution here is pretty much ineffective unless either operation is performed a large number of times within a single period. Counting the operation on a Map is far cheaper than timing an operation, but it only makes sense in some other enclosing execution scope. What other things can we count? Well, we could count the outcomes. The outcome for a Map.get operation can be either a SUCCEED or FAIL. Tracking the outcome of a Map.get function can be useful and is something commonly done when a Map is in effect a cache. But again, such a count would be far more useful within the scope of another operation. We can’t be sure that a particular outcome changes the lookup operation’s performance, but that is not to say it is not useful to know and associate with other counts.

We need to peak below the surface of the Map.get function. In what way can the performance of the Map.get function deviate from expectation? The hashing of the key parameter within the function call is something that is always done. Variance only ever enters the cost equation when the hash lands at an address where there exist multiple possibilities, and each one needs to be evaluated. The larger the number of collisions and the positioning of our key within the list dictates the amount of relatively expensive comparison operations performed. The signal we are looking for is the count of key comparisons, anything above one is unexpected (in the majority of cases). Not finding a matching key, and in turn, value, after scanning a list of possibilities, merely adds salt to the wound. If a Map uses open addressing, then for the Map.put operation, the variation would be determined by the number of slots probed before an empty slot was found. A far more expensive outcome of a Map.put operation would be resizing the underlying storage array, here the number of existing elements dictates the cost. So it is not enough to have a signal, say RESIZE, we also need a signal strength representing the map’s size. Signaling here on resizing with the existing capacity is as cheap as it gets for observability.

An observability engineer might consider turning the RESIZE signal and strength into a counter metric. Still, considering how prevalent instances of Map are within most enterprise application spaces, it would be meaningless and noisy at the aggregated process level. Here is where scoping by way of nested call boundaries helps, much like a trace. But let’s suppose our naive engineer ignored this and decided to give a unique name to every map instance and, in turn, signal metric – this is pretty much where we are with metrics today. We’ve thousands, in some cases millions, of uniquely named metrics that have similar suffixes (or shapes). The signal is now completely buried within a namespace. It is impossible to know whether the inclusion of the term “resize” within a metric pertains to an underlying Map data structure. We have effectively untyped the source, signal, and strength. And because of this, we are now unable to abstract the type into more meaningful semantic models of such things as Service, Resource, or Scheduler.

Our current approaches to observability, such as metrics, logging, and tracing, have resulted in a kind of semantic blindness by way of a down in the details big data mindset. It reminds me of one of those overly generic business application frameworks, where the architect, after tearing away at the very essence of the domain semantics, thought he could simplify all business workflow code by treating (abstracting) all entity and process interfaces as just another glorified Map interface with put and get acting as both property and function operators. With such framing, a method dispatch becomes another Map operation with the get method being used for functions and the set method for operations. In the Observability space, far more operationally meaningful concepts such as Context, Environment, Name, Service, Resource, Scheduler, Signal, and Status are being lost to a generation of site reliability engineers (SRE) in favor of arbitrary instances of Number (counter, gauge, timestamp) and String (name, trace id, span id).

I’m not advocating for the complete abandonment of metrics and events toolkits. What I believe to be an industry-wide failure is the bottom-up approach we have taken with such kits. Instead, such yesteryear observability collectors should become consumers, via a plugin interface, to more semantically rich toolkits like OpenSignals. Instrumentation code added into an application codebase should not be interacting with metric, trace, or log event, but instead service, resource, scheduler, as well as related signals and states.