Date Posted: 6/21/2019

CS Associate Professor Nate Foster and colleagues from Princeton, Johns Hopkins, Berkeley, and other institutions will present their award-winning paper “NetChain: Scale-Free Sub-RTT Coordination” at a SIGMETRICS plenary talk—“Highlights Beyond SIGMETRICS”—this June at the ACM FCRC (Association for Computing Machinery, Federated Computing Research Committee) in Phoenix. SIGMETRICS is a conference focused on treating computer systems as things that can be studied empirically, e.g., by measuring performance, building quantitative models, etc. The “highlights”-track seeks to bring engaging ideas from other areas of CS that may be of interest to the SIGMETRIC community.

The NetChain paper—by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; and Ion Stoica, Berkeley—was awarded Best Paper by Networked Systems Design and Implementation (NSDI) in 2018

The algorithm behind NetChain was originally invented by Fred Schneider and Robbert van Renesse in an Operating Systems Design and Implementation (OSDI) paper on “Chain Replication” from 2004.

As noted in in The Morning Paper, Jin, et al. “have demonstrated how to build a coordination service (think Apache ZooKeeper) with incredibly low latency and high throughput. We’re talking 9.7 microseconds for both reads and writes, with scalability on the order of tens of billions of operations per second.” For this reason, “[b]y using NetChain as a lock server, the system can achieve orders of magnitude higher transaction throughput than ZooKeeper.”

Here is an abstract of the NetChain paper

  • Coordination services are a fundamental building block of modern cloud systems, providing critical functionalities like configuration management and distributed locking. The major challenge is to achieve low latency and high throughput while providing strong consistency and fault-tolerance. Traditional server-based solutions require multiple round-trip times (RTTs) to process a query. This paper presents NetChain, a new approach that provides scale-free sub-RTT coordination in datacenters. NetChain exploits recent advances in programmable switches to store data and process queries entirely in the network data plane. This eliminates the query processing at coordination servers and cuts the end-to-end latency to as little as half of an RTT—clients only experience processing delay from their own software stack plus network delay, which in a datacenter setting is typically much smaller. We design new protocols and algorithms based on chain replication to guarantee strong consistency and to efficiently handle switch failures. We implement a prototype with four Barefoot Tofino switches and four commodity servers. Evaluation results show that compared to traditional server-based solutions like ZooKeeper, our prototype provides orders of magnitude higher throughput and lower latency, and handles failures gracefully.