Skip to main content
To KTH's start page To KTH's start page

Application-level Chaos Engineering

Time: Tue 2022-11-29 09.00

Location: F3, Lindstedtsvägen 26 & 28, Stockholm

Video link:

Language: English

Subject area: Computer Science

Doctoral student: Long Zhang , Teoretisk datalogi, TCS

Opponent: Professor Leon Moonen, Simula Research Laboratory, Oslo, Norway

Supervisor: Professor Martin Monperrus, Teoretisk datalogi, TCS; Professor Benoit Baudry, Programvaruteknik och datorsystem, SCS

Export to calendar

QC 20221102


With the development of software techniques, software systems nowadays are becoming highly complex. In order to keep such systems as reliable as possible, developers need to design various error-handling mechanisms. Considering that the error-handling code needs to work properly in production, it should not only be tested offline but also verified in production after deploying the system. Chaos engineering is a technique that assesses a software system's error-handling mechanisms in production directly. In order to apply chaos engineering, developers first monitor the target system and identify its steady state. Then specific failures are injected in a controlled manner so that the system's error-handling code is triggered and analyzed. By comparing the observed behavior during a chaos engineering experiment with the steady state, developers confirm whether the designed error-handling mechanisms work as expected.

In the field of chaos engineering, there still exist technical challenges that affect the effectiveness of the approach. This thesis makes contributions to the following three open challenges in chaos engineering.

First of all, as chaos engineering experiments are done in production, it is important to improve the efficiency of these experiments. In order to reduce unrealistic experiments, we propose a new approach that synthesizes chaos engineering fault models using the naturally happening errors in production.

Second, in order to analyze a system's steady state and detect its abnormal behavior during chaos engineering experiments, sufficient observability is the key. We propose a multi-layer observability improvement solution for Dockerized Java applications. With the help of our solution, developers are able to improve an application's observability at the operating system level, the runtime environment level, and the application level, with limited effort.

Last, chaos engineering should be helpful to locate actual places for resilience improvements. We propose three fault injection approaches that apply chaos engineering at the application level to take domain-specific knowledge into consideration.