Observability via (Verbal) Protocol Analysis

Think Aloud

We are always looking for new ways to explain and relate to the OpenSignals conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent certification in Design Thinking and some readings in situational awareness. VPA is a technique that has been used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes from start to completion of a task. After further processing, the information captured is then analyzed to provide insights that can improve performance.

Voluminous Data

An advantage of verbal protocol analysis over other cognitive investigation tasks is the richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed. Sound familiar? Site reliability engineering (SRE) teams face the same issue when their primary data sources for monitoring and observability are event logging and its sibling distributed tracing.

Protocol Analysis: The Steps

The basic steps to Protocol Analysis are (1) recording the verbalization, (2) transcribing the recording, (3) segmenting the transcription, (4) aggregating the segments into episodes, (5) encoding the episodes, and finally (6) analyzing the code sequencing patterns. During the transcribing step, researchers will interpret the recording in terms of a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses one idea or action statement. Some segments are collapsed and combined into episodes in the aggregate step to make further coding and data analysis more straightforward, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost.


Coding Scheme

The most crucial step to this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, a coding scheme needs to be effective and reliable in translation and express the aspects of concern for the investigation. Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes. In the case of an investigation into how designers think, the variables might be the “design step”, “knowledge”, “activity”, and “object”. A more abstract variable set would be “subject”, “predicate”, and “object”.

Reliability and Effectiveness

A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement or event in the real world and mapping it to the appropriate code across different persons tasked with the coding. A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be OpenSignals service status inference from signaling.

Logging: Tribulations in Transcribing

If you have ever spent time working with vast amounts of logs, metrics, and distributed tracing data, you immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services. These days, most site reliability engineers get stuck in the transcribing phases, trying to bring uniformity and meaning to many different machine utterances, especially in logs and events. We’ve witnessed many an organization start an elaborate and ambitious initiative to remap all log records into something more relatable to service level management or situational awareness via various record-level rules and pattern matches only to abandon the initiative when the true scale of the problem is recognized and the human effort involved to not only to define but maintain such things.

Smoke and Mirrors

These tasks only ever look good in vendor demonstrations, never reflecting the change rate that all software is undergoing at present and into the future. You might ask how has Protocol Analysis in practice attempted to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand. It should be noted that for many doing VPA in new domains that the coding scheme is defined much later in the process. Fortunately, in the Observability space of software systems, we deal with machines instead of humans, so it is far easier to introduce appropriate coding into the coding process. That is what OpenSignals is offering – a fixed set of variables in the form of “service”, “orientation”, and “phenomenon”. And a set of predefined codes for orientation and phenomenon (signals and status).


OpenSignals for Services is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work. It is time for software services to think aloud with OpenSignals and abandon sending meaningless blobs of data to massive event data blackholes in the cloud. It is time to standardize a model that serves site reliability engineering and not some manufactured data addiction. Let’s have both machines and humans communicate in terms of service, signal, and status.