Observability via (Verbal) Protocol Analysis

I’m always on the lookout for new ways to explain and relate to the design of the OpenSignals conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent certification in Design Thinking and some readings in situational awareness. VPA is a technique that has been used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes involved by a subject from start to completion of a task. After further processing, the information captured is then analyzed to provide insights that can be used to improve performance, possibly.

An advantage of verbal protocol analysis over other cognitive investigation tasks is its richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed. Sound familiar? Yes, the same issue site reliability engineering (SRE) teams face when their primary data sources for monitoring and observability are event logging and its sibling distributed tracing.

From Record to Analysis

The basic steps to Protocol Analysis are (1) record the verbalization, (2) transcribe the recording, (3) segment the transcription, (4) aggregate the segments into episodes, (5) encode the episodes, and finally (6) analyze the code sequencing patterns. During the transcribing step, researchers will interpret the recording in terms of a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses one idea or action statement. In the aggregate step, some segments are collapsed and combined into episodes to make further coding and data analysis more straightforward, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost. The most crucial step to this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, a coding scheme needs to be effective and reliable in translation and express the aspects of concern for the investigation. Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes.

In the case of an investigation into how designers think the variables might be the "design step", "knowledge", "activity", and "object". A more abstract variable set would be "subject", "predicate", and "object". A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement or event in the real world and mapping it to the appropriate code across different persons tasked with the coding.

A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be OpenSignals service status inference from signaling.

If you have ever spent time working with vast amounts of logs, metrics, and distributed tracing data, you immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services. These days, most site reliability engineers get stuck in the transcribing phases, trying to bring some uniformity and meaning to many different machine utterances, especially in logs and events. I’ve witnessed many an organization start an elaborate and ambitious initiative to remap all log records into something more relatable to service level management or situational awareness via various record-level rules and pattern matches only to abandon the initiative when the true scale of the problem is recognized and the human effort involved to not only to define but maintain such things. These tasks only ever look good in vendor demonstrations, never reflecting the change rate that all software is undergoing at present and into the future. You might ask how has Protocol Analysis in practice attempted to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand.

It should be noted that for many doing VPA in new domains that the coding scheme is defined much later in the process. Fortunately, in the Observability space of software systems, we are dealing with machines as opposed to humans, so it is far easier to introduce appropriate coding into the coding process. That is precisely what OpenSignals is offering – a fixed set of variables in the form of "service", "orientation", and "phenomenon", and a set of predefined codes for orientation and phenomenon (signals and status).

From Smart to Simple

OpenSignals for Services is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work.

It is time for software services to think aloud with OpenSignals and to abandon sending meaningless blobs of data to massive event data blackholes in the cloud. It is time to standardize on a model that serves site reliability engineering and not some manufactured data addiction. Let’s have both machines and humans communicate in terms of service, signal, and status.

If you are interested in where we go from here, having managed to spend more time in analysis, then do yourself a favor and read Mark Burgess’s recent research paper – The Semantic Spacetime Hypothesis A Guide the Semantic Spacetime Project Work.