Distributed Tracing – A Path to be Traced Lightly

Data Dungeons

Much of what is posted in the Observability space these days makes a claim, without much in the way of independent cost-benefit analysis, that to monitor highly interconnected systems effectively, one needs to trace and correlate every request across every hop (span) in the network. Deep and extensive data collection is actively encouraged, which one could argue is more a reflection of a lack of an appropriate model of perception than the utility of the data or the ability of a solution to transform such collections into information and, in turn, insight.

Redundant Reductionism

Just because it [distributed tracing] can be done, with great expense, and visually represented, does not mean it should be considered as something to be employed routinely. Science and technology have made it possible to observe the motion of atoms, but yet humans don’t go around actively watching such movements in their navigation of physical spaces. Our perception, attention, and cognition have scaled evolutionary to an effective model for us in most situations. Distributed tracing paths and the data items attached are the atoms of Observability. The primary purpose is to monitor and manage service levels instead of debugging code or drilling through data tags, items, values, records, and sets.

Distributed Tracing

From many years of experience in designing, developing, and deploying distributed tracing systems and solutions, the reason it gets far more attention than it rightfully deserves for the engineering effort involved and conceptual complexity introduced, for both customer or vendor, is that (local) code call tracing can in itself be useful; most developers can understand a profile consisting of a tree of nodes representing chains of methods calls. But with increasing decoupling in terms of time and space, this is a questionable approach to undertake unless one is utterly blind to everything else; and that could very well be the crux of the problem that has been engineered without much critical reflection and assessment of the effectiveness and efficiency in deploying distributed tracing.

Misguided Practice

The emphasis on distributed tracing contexts, paths, and data payload captures is misguided. It runs counter to the touted encapsulation engineers introduce into their system, service, and library designs. Again local tracing can be useful in some cases but distribution even less so. Distribution should not be the default case; code should not be separated by process boundaries so that engineering can obtain some degree of observability into a service. Don’t deploy hundreds of microservices or introduce a service mesh infrastructure for the sole purpose of profiling code execution over and via sockets. Observability is not a reason for distributing.

System Steering

The engineering community must consider changing course to reflect the original purpose and definition of observability – to infer the process state and introduce control measures where and when necessary to stabilize systems. Data exploration is mostly a superficial stop-gap measure when engineering can’t steer systems and services to a stable state within the course of execution flow or transitioning between change points. SRE and DevOps teams must step back from a dark data abyss if observability initiatives achieve success beyond the painting of pretty but mostly meaningless charts on dashboards.

Effective Signposts

The more data you collect, the more you realize how much you don’t know, not because the data has shown this to be the case but because it has hijacked precious attention, overloaded cognitive capacities, and delayed decisive actions—deciding to collect everything while a simple decision is anything but simple. Simplicity, sensibility, and significance must return if there is greater awareness and a more in-depth understanding of a situation and timely intelligent intervention. Observability assists in the operational monitoring and management of services through clear and concise communication centered around change and controllability.

Situation: A State-Space

The situation, a state-space, alone should dictate what other forms of observability to be enabled dynamically. Deep and detailed data collection, such as tracing, logging, and events, should follow and be framed by the situation – one that is derived from and described in terms of services, signals, and states. The situation we seek can not be found easily at the atomically level of data. Higher-order thinking, reasoning, and modeling focused on the dynamics instead of the data payloads is a mandatory requirement and a foundational framework for effectiveness, efficiency, and execution at scale in terms of collection, communication, change, and control.

Divergence Detection

Only when there is a divergence from an expectation (past behavior) or prediction (planned change) should retention of detailed diagnostics be activated. Tracing is invariably transient data – not a model suitable for managing a process or a system of processes. At the atomic level, the (data) differences are everywhere, yet there is no divergence to be seen and responded to in all practical sense. A largely quantitative measurement like tracing should not be the starting point for any enterprise observability or operational initiative. The model of communication between machines and human scales best with qualitative-based analysis and modeling. Data obesity and addiction must be fought with a renewed focus on abstraction, communication, and dynamics.

Conversations in Context

With distributed tracing, the request and the relationship of multiple chained requests get center stage – services are an afterthought, so much so that the concept of a service name was a post-release addition and still not readily supported by all tracing providers today. With OpenSignals, the rich contextual dynamic nature of the conversation between one service and another, over multiple interactions, is given priority in the model. Distributed tracing is, for the most part, oblivious, if not blind, to how services differ in their sensitivity to the service levels and the resilience mechanism employed when expectations are not met. In contrast, OpenSignals focuses on capturing the locality of assessment and representing the levels of service quality that exist.