Distributed Tracing – A Path to be Traced Lightly

Much of what is posted in the Observability space these days makes a claim, without much in the way of independent cost-benefit analysis, that to monitor highly interconnected systems effectively, one needs to trace and correlate every request across every hop (span) in the network. Deep and extensive data collection is actively encouraged, which one could argue is more a reflection of a lack of an appropriate model of perception than the utility of the data or the ability of a solution to transform such collections into information and, in turn, insight.

Just because it [distributed tracing] can be done, with great expense, and visually represented, does not mean it should be considered as something to be employed routinely. Science and technology have made it possible to observe the motion of atoms, but yet humans don’t go around actively watching such movements in their navigation of physical spaces. Our perception, attention, and cognition have scaled evolutionary to a model that is effective for us in the vast majority of situations. Distributed tracing paths and the data items attached to such are the atoms of Observability when the primary purpose is to monitor and manage service levels as opposed to debugging code or drilling through data tags, items, values, records, and sets.

From many years of experience in designing, developing, and deploying distributed tracing systems and solutions the reason it gets far more attention than it rightfully deserves for the engineering effort involved and conceptual complexity introduced, for both customer or vendor, is that (local) code call tracing can in itself be useful; most developers can understand a profile consisting of a tree of nodes representing chains of methods calls. But with increasing decoupling in terms of time and space, this is a questionable approach to undertake unless one is utterly blind to everything else; and that could very well be the crux of the problem that has been engineered without much critical reflection and assessment of the effectiveness and efficiency in deploying distributed tracing.

The emphasis on distributed tracing contexts, paths, and data payload captures is misguided. It runs counter too much of the touted encapsulation engineers attempt to introduce into their system, service, and library designs. Again local tracing can be useful in some cases but distribution even less so. Distribution should not be the default case; code should not be separated by process boundaries so that engineering can obtain some degree of observability into a service. Don’t deploy hundreds of microservices or introduce a service mesh infrastructure for the sole purpose of profiling code execution over and via sockets. Observability is not a reason for distributing.

It is crucial that the engineering community consider changing course now to one that better reflects the original purpose and definition of observability – to infer the process state and from there to introduce control measures where and when necessary to stabilize systems. Exploration of data is mostly a superficial stop-gap measure when engineering can’t steer systems and services to a stable state within the course of execution flow or transitioning between change points. SRE and DevOps teams have got to step back from a dark data abyss if observability initiatives are to achieve success beyond the painting of pretty but mostly meaningless charts on dashboards.

The more data you collect, the more you realize how much you don’t know, not because the data has shown this to be the case but because it has hijacked precious attention, overloaded cognitive capacities, and delayed decisive actions—deciding to collect everything while a simple decision is anything but simple. Simplicity, sensibility, and significance must return if there is to be greater awareness and a more in-depth understanding of a situation and from there timely intelligent intervention. Observability is there to assist in the operational monitoring and management of services by way of clear and concise communication centered around change and controllability.

The situation, a state-space, alone should dictate what other forms of observability to be enabled dynamically. Deep and detailed data collection, such as tracing, logging, and events, should follow and be framed by the situation – one that is derived from and described in terms of services, signals, and states. The situation can not be found easily at the atomically level of data. Higher-order thinking, reasoning, and modeling that is focused on the dynamics as opposed to the data payloads is a mandatory requirement and a foundational framework for effectiveness, efficiency, and execution at scale in terms of collection, communication, change, and control.

Only when there is a divergence from an expectation (past behavior) or prediction (planned change) should retention of detailed diagnostics be activated. Tracing is invariably transient data – not a model suitable for managing a process or a system of processes. At the atomic level, the (data) differences are everywhere and yet there is no divergence to be seen and responded to in all practical sense. A largely quantitative measurement model like tracing should not be the starting point for any enterprise observability or operational initiative. The model of communication between machines and human scales best with qualitative-based analysis and modeling. Data obesity and addiction must be fought with a renewed focus on abstraction, communication, and dynamics.

With distributed tracing, the request and the relationship of multiple chained requests get center stage – services are an afterthought, so much so that the concept of a service name was a post-release addition and still not readily supported by all tracing providers today. With OpenSignals, it is the rich contextual dynamic nature of the conversation between one service and another, over multiple interactions, that is given priority in the model. Distributed tracing is, for the most part, oblivious, if not blind, to how services differ in their sensitivity to the service levels and the resilience mechanism employed when expectations are not met. In contrast, OpenSignals focuses on capturing the locality of assessment and representing the levels of service quality that exist.