The 5th Workshop on Data Management for End-to-End Machine Learning, Sunday, 20th of June, 2021

Held in conjunction with ACM SIGMOD/PODS 2021
Xi'an, Shaanxi, China, June 20th - June 25th, 2021


Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads may be complicated and require simultaneous execution of relational and linear algebraic operations. Next, model selection may involve searching many combinations of model architectures, features, and hyperparameters to find the best-performing model. After model training, the resulting model may have to be deployed and integrated into business workflows and require lifecycle management using metadata and lineage. As a further complication, the resulting system may have to take into account a heterogeneous audience, ranging from domain experts without programming skills to data engineers and statisticians who develop custom algorithms.

Additionally, the importance of incorporating ethics and legal compliance into machine-assisted decision-making is being broadly recognized. Critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. DEEM welcomes research on providing system-level support to data scientists who wish to develop and deploy responsible machine learning methods.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.

Areas of particular interest for the workshop include (but are not limited to):

 - Data Management in Machine Learning Applications
 - Definition, Execution and Optimization of Complex Machine Learning Pipelines
 - Systems for Managing the Lifecycle of Machine Learning Models
 - Systems for Efficient Hyperparameter Search and Feature Selection
 - Machine Learning Services in the Cloud
 - Modeling, Storage and Provenance of Machine Learning Artifacts
 - Integration of Machine Learning and Dataflow Systems
 - Integration of Machine Learning and ETL Processing
 - Definition and Execution of Complex Ensemble Predictors
 - Sourcing, Labeling, Integrating, and Cleaning Data for Machine Learning
 - Data Validation and Model Debugging Techniques
 - Privacy-preserving Machine Learning
 - Benchmarking of Machine Learning Applications
 - Responsible Data Management
 - Transparency and Accountability of Machine-Assisted Decision Making
 - Impact of Data Quality and Data Preprocessing on the Fairness of ML Predictions


Papers submission deadline:             15th of March
Authors notification:                   19th of April
Deadline for camera-ready copy:         24th of May
Workshop:                               Sunday, 20th of June


Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. Please use the latest ACM paper format (last update 11/2020). DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.

Submission Website:
Inclusion and Diversity in Writing:


The workshop proceedings will be published in ACM DL.


 - Matthias Boehm (TU Graz)
 - Julia Stoyanovich (NYU)
 - Steven Whang (KAIST)


 - Alekh Jindal (Microsoft)
 - Alex Ratner (Stanford)
 - Andrey Gubichev (Google)
 - Arash Termehchy (Oregon State University)
 - Arun Kumar (University of California, San Diego)
 - Bolin Ding (Alibaba Group)
 - Doris Xin (UC Berkeley)
 - Georgia Koutrika (Athena Research Center)
 - Guoliang Li (Tsinghua University)
 - Jae-Gil Lee (KAIST)
 - Ke Yang (New York University)
 - Maya Ramanath (IIT Delhi)
 - Meihui Zhang (Beijing Institute of Technology)
 - Neoklis Polyzotis (Google)
 - Nesime Tatbul (Intel Labs and MIT)
 - Rainer Gemulla (Universitat Mannheim)
 - Rajesh Bordawekar (IBM T. J. Watson Research Center)
 - Srikanta Bedathur (IIT Delhi)
 - Sudip Roy (Google)
 - Uwe Roehm (The University of Sydney)