How to Solve Incidents - a research proposal

How to Solve It, but for operational incidents.

Engineers with site reliability responsibilities are often faced with operational issues (incidents) that have an unknown cause and uncertain solution. How to Solve It is a classic work describing heuristics for solving mathematical problems. There have been adaptations of this book for different domains, but none yet for software-intensive operations. Within this study, we aim to build a taxonomy or categorization of techniques/heuristics in practice for understanding and overcoming operational issues. The catalog will be based on a survey of experienced engineers and real incident logs and postmortems from a variety of organizations. Readers will be able to leverage the catalog to solve tricky and novel production issues.

Research Plan

This project will use both survey and field study methods to catalog “in practice” incident mitigation techniques and heuristics. We expect the project to take between 6 and 9 months (to the point of a submission-ready article), although the timeline may be extended due to data access issues since operational data is often sensitive and negotiations may be prolonged.

Data Collection

We will reach out to organizations (primarily SaaS or computing intensive companies) for:

  1. Interviews with senior engineers — how do they approach an unfamiliar production problem? do they use any explicit mental models, heuristics, or existing processes?
  2. Access to incident logs, chat/ticket threads, and/or root cause analyses. Basically, documents describing how the team is working to resolve incidents, preferably “raw”. The research team will then:
    1. Document problem solving threads by incident — i.e., faced with x situation, attempted y or sought to determine z status. Was this successful?
    2. Document what mental models (if any) were being used? Was this mental model correct?
    3. Note: Raw incident logs are preferred to RCAs. For example, Slack’s DNSSEC RCA is very detailed on the cause of the problem, the prior attempts to prevent an outage, and actions to correct the outage once it happened, but is light on details about specific troubleshooting steps and unsuccessful threads. RCAs can provide useful data, but usually gloss over the troubleshooting activity.
  3. Access to “Try if everything else fails”-style runbooks (if they exist).

A suitable organization should have at least 30 relevant incidents. We are targeting at least 10 organizations within the study.

A sample of incidents will be excluded from the coding effort to form an “out of band” pool. We will also augment the out of band pool with publicly documented incident RCAs.

Data Protection

We anticipate that all organizations that participate in the study will be acknowledged within the paper and that individuals interviewed may be quoted with their consent. The paper will not link heuristics to specific companies or incidents; heuristics and techniques are generalized and anonymized. A heuristic must be practiced by at least two organizations to be referenced.

We expect to sign NDAs or similar legal agreements to protect sensitive data and may accept reasonable constraints on access and storage of any shared data.

As members of the Association of Computing Machinery, we are bound by the Code of Ethics, in particular section 1.7 Honor Confidentiality:

Computing professionals are often entrusted with confidential information such as trade secrets, client data, nonpublic business strategies, financial information, research data, pre-publication scholarly articles, and patent applications. Computing professionals should protect confidentiality except in cases where it is evidence of the violation of law, of organizational regulations, or of the Code. In these cases, the nature or contents of that information should not be disclosed except to appropriate authorities. A computing professional should consider thoughtfully whether such disclosures are consistent with the Code.

Create taxonomy/categorization

Using the collected data, we expect to code it via a faceted coding approach. An example of a facet coding is Ranganathan’s five core categories, typically labeled PMEST. P is the personality or most specific subject, M is the material or component, E is the energy or activity, operation, or process, S is the space or location, and T is the time. A specific heuristic may not have all five facets.

As an example, let’s say that within an incident the engaging team has detected a drop in production rate from a distributed group of queue-based workers. For coding, the personality is the queue-based job workers. The material could be the workers, the queue, the underlying compute infrastructure, a worker dependency, or even ancillary items like the logging and monitoring subsystem. The energy will be the specific action, such as restarting workers, checking connections to the queue, or verifying health of dependencies.

The coded values can then be grouped into a taxonomy by grouping similar items by facets and frequency.

Why go through the effort to create a taxonomy? How to Solve It organizes the heuristics alphabetically, How to Solve It: Modern Heuristics organizes items hierarchically by academic discipline, and Discussion of the Method : Conducting the Engineer’s Approach to Problem Solving lacks any reference-able organization. There are two reasons to go through the effort: construction and reference. Construction-wise, a taxonomy of heuristics can identify gaps in the catalog (similar to gaps in the Periodic Table identified elements). Reference-wise, a good organization will make it easier for readers to apply to incidents.

Typology versus Taxonomy. Both are terms for categorization, but in some domains a typology is conceptual while a taxonomy is empirical. Typologies are organized by dimensions that a human specifies, while a taxonomy is created via cluster analysis based on natural attributes or dimensions selected by humans. Since heuristics lack natural attributes, the more correct term for this activity is typology, albeit more obscure.

Testing the Taxonomy

Ideally, the true test of efficacy would track reduction in repair time due to improved problem solving with and without the catalog. However, this would require controlled experiments or a large statistical study, both of which are studies in their own right. Instead, the validity of the catalog will be evaluated for (practical) completeness and likelihood of utility.

Completeness will be evaluated via out-of-band examples. For the incidents and public RCAs marked as out-of-band during the data collection phase, we will map the incident’s narrative to the coded catalog to check for any absences in the catalog or difficulties in following the organization. Out-of-band examples may be added to the catalog (assuming they meet our inclusion threshold), but their initial absence is a useful statistical indicator of completeness.

Likelihood of utility will be evaluated via an outside group of both inexperienced and experienced engineers who will be asked to use the catalog on prior incidents (of their choosing) and current incidents and record their experiences. (See Use testing for more context.) Their experiences will be recorded via survey and aggregated.

Hypothetical Categorization

To provide an idea of what the categorization might look like:

  1. Have you tried turning it off and on again? (Blind luck strategies)
    1. Restarting the complaining service
    2. Restarting the complaining service’s dependencies
    3. Redeploying
    4. Moving to another host / cluster
    5. Rollback
    6. Search internet for similar error messages
  2. Configuration management
    1. Restore previous configuration
    2. Attempt to isolate problem to certain instances / shard
    3. Verify configuration same between instances
    4. Restore previous configuration (dependency)
  3. Capacity management
    1. Increase allocation
    2. Decrease allocation
    3. Rebalance among shards
  4. Traffic management
    1. Throttle traffic source
    2. Eliminate traffic source
    3. Route traffic to another shard
    4. Verify traffic along route
    5. Eliminate traffic magnification
  5. Broaden scope
    1. Orchestration issue
    2. Operating System/Host Issue
    3. DNS / Service Lookup Issue
    4. Network Issue

Potentially, the categorization could become a decision tree, depending on the statistical strength of the data.

Each entry would contain a description, including any known failure modes and additional context.

Prior Art (and why this is novel)

How to Solve It by G.Polya

This is a classic work on mathematical methods to solve problems. Although some of the heuristics described (e.g. Do you know a related problem, Draw a figure, Generalization) are applicable beyond mathematical problems, most of the heuristics fall within the mathematics domain. In contrast, the proposed project targets operating software intensive systems and also seeks to quantitatively find which heuristics seem to work most often.

How to Solve It: Modern Heuristics by Z. Michalewicz and B. Fogel

This book follows a similar approach to How to Solve It, but changes the domain to computer science algorithms by discussing approaches like constraint solving, search, and evolution. None of the heuristics are particularly relevant to operations.

Effective Troubleshooting by Chris Jones (Google’s SRE Book); Atlassian Incident Management Handbook

Both focus on the management of incidents (e.g. communication, maintaining focus, leadership) and the general process of solving problems rather than specific actions or techniques the engineering group can perform, particularly if the cause is novel. Although a scientific or structured approach certainly aids in mitigating issues, our work will be focused on specific techniques to be used within that greater framework.

How to mitigate the incident? An effective troubleshooting guide recommendation technique for online service systems1 This paper studies “troubleshooting guides” (TSGs) at Microsoft and develops a recommendation system for suggesting specific actions operators can take based on an incident description. Examining incident records, they find a lower-bound of 27.2% of incidents that had an existing TSG and a large proportion (~60%) of incidents contained novel failures (no existing or relevant guide). The proposed work is not focused on making it easier to find existing run books to solve recurring or known failure modes, but shorten the time to restore for novel or ambiguous failure modes, which the paper suggests are the majority of cases within their sample.2

Publication Venue

Peer reviewed with free access to final paper (e.g. Open Access). Potential venues include SREcon, DevOpsDays, and magazines with broader scope like Communications of the ACM (Contributed Articles section).

If there is a large amount of material, this could become a book or long-form website, but we would prefer to start with an article.

  1. Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2020. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1410–1420. DOI: ↩︎

  2. As noted in the paper, Microsoft does not require TSGs to be developed nor maintained. We expect that the availability of runbooks to vary widely between organizations based on their culture. ↩︎