Skip to main content
To KTH's start page

Context-Aware Consensus for Efficient State Machine Replication

Time: Mon 2026-06-08 14.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Video link: https://kth-se.zoom.us/j/67910663588

Language: English

Doctoral student: Harald Ng , Datatekniska och lärande system

Opponent: Professor Khuzaima Daudjee, Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada

Supervisor: Associate professor Paris Carbone, Datatekniska och lärande system; Professor Sarunas Girdzijauskas,

Export to calendar

QC 20260513

Abstract

Consensus protocols are a cornerstone of distributed systems. They enable multiple, independent nodes to reach an agreement that cannot be overturned. This is fundamental in building replicated services that appear as a single system with strong consistency and high availability guarantees. Today, consensus protocols are central to many critical services, from cluster orchestration systems to global-scale distributed databases. However, despite their widespread adoption across diverse systems, the consensus protocols used in practice typically adhere to a general-purpose design that is agnostic to the execution environment in which they operate. As a result, significant optimization opportunities are left unexploited across the network, workload, and storage layers. These opportunities have become increasingly important to leverage as modern applications and infrastructures demand higher performance, stronger resilience, and more flexibility than a one-size-fits-all design can provide.

This dissertation explores how consensus protocols can be made context-aware to derive optimizations from their execution environment that improve resilience, performance, and adaptability. We present four such mechanisms, each addressing a different layer of the consensus stack. Omni-Paxos provides a principled approach to handling partial connectivity failures at the network layer. UniCache reduces redundant communication by learning from recurring patterns in the application workload. Metronome leverages the characteristics of persistent storage to expose a fine-grained trade-off between runtime and recovery performance. AutoQ continuously adapts critical configuration parameters to sustain high performance under changing workloads in multi-region deployments. Crucially, these improvements are realized as bolt-on mechanisms that attach to protocols commonly used in practice, avoiding the burden of designing a new protocol from scratch. Together, these contributions demonstrate that embedding context-awareness into consensus allows established protocols to effectively meet the performance and resilience demands of modern distributed systems.

Link to DiVA