Finding Faults in Distributed Systems (really)

06/18/2019 - 12:20 to 13:00
long talk (40 min)

Session abstract: 

Distributed systems are hard to troubleshoot. The reasons are simple. These systems are complex and, well, distributed. With containers and Kubernetes and IoT, services spread across hardware for redundancy, but that hardware is also shared with other services. Embedding machine learning into systems can make things worse because we often don't even know precisely what a system should be doing.

There are some very simple techniques, however, that can help detect and localize faults in complex systems even though these faults may affect only a tiny fraction of the operations of the system and may not even involve any outright failed requests. Moreover, even while detecting these faint signals, we have to avoid being flooded with false positive indications.

Simple enough to run on tiny hardware, but efficient enough to run at scale, the half dozen or so techniques I will describe belong in your toolbox. I will describe how they work, where you can get the code, and how you can embed them into your system. I will also describe a tour of how they work in an existing complex system.