Memory Efficient Systems for the Modern Data Processing Stack

Yifan Dai
Department of Computer Sciences
University of Wisconsin-Madison

Abstract:

The world has witnessed exponential growth of data. People interact with data every day through social media, e-commerce, and so on.

People have built software stacks to efficiently work with data. A modern data processing stack includes layers that are responsible for ingesting, storing, transforming, and utilizing data. With the data becoming larger, the data processing stack requires better memory efficiency. There has been extensive study from the systems community to optimize each layer of the stack, but communication between them is often overlooked, leading to hidden overheads across layers.

In this thesis, we introduce three aspects of study on memory efficiency of the data processing stack, both on optimizations within a single layer and those across layers. The first part of the thesis introduces a learned index for Log-structured Merge (LSM) Trees. The learned index is 0.5x to 0.75x smaller than the original index and improves the in-memory workload performance by 1.5x on average.

In the second part, we focus on the caching problem between the layer of storage engines and the layer of the underlying kernel. Both storage engines and the underlying kernel use data caching and they share the same memory quota, implicitly forming a two-level cache structure of which storage engines are often unaware. We introduce a framework that automatically optimizes cache allocation of storage engines by online cache simulation and dynamically adapts to different workloads. We incorporate our system into 3 popular key-value storage systems and provide a 1.5x gain on average.

The third part focuses on removing data copying and duplication in data pipelines on a single machine. Data communication between different nodes in a data pipeline currently requires full copying even if they are located on the same machine due to limited kernel support. We build a pipeline execution engine with co-design of new kernel mechanisms and container runtime. Our new system provides a 2x gain for real-world data pipelines.

We design and implement our solutions into real systems and yield benefits on real data processing workloads. We believe that our work will inspire the future development of the data processing stack.

Full Paper: PDF BibTeX

Publications