A Scalable Failure Recovery Model for Tree-based Overlay Networks
Dorian Arnold, Barton Miller
We present a scalable failure recovery model for data aggregations in large scale tree-based overlay networks (TBONs). A TBON is a network of hierarchically organized processes that exploits the logarithmic scaling properties of trees to provide scalable data multicast, gather, and in-network aggregation. TBONs are commonly used in debugging and performance tools, system monitoring, information management systems, stream processing, and mobile ad hoc networks. Our recovery model leverages inherent information redundancies in TBON computations. This redundant information is gathered from non-failed processes to compensate for computation and communication state lost due to failures. This state compensation strategy is attractive because: (1) it avoids the time and resource overheads of previous reliability approaches, which rely on explicit replication; (2) recovery is rapid and only involves a small subset of the network; and (3) it applies to many useful, complex computations. In this paper, we formalize the TBON model and its fundamental properties to prove that our state compensation model properly preserves computational semantics across TBON process failures. These properties lead to an efficient implementation of state compensation, which we use to empirically validate and evaluate recovery performance. We show that state compensation can recover from failures in extremely large TBONs in milliseconds rendering practically no application service interruption.
Download this report (PDF)
Return to tech report index