CMU-CS-20-140 Computer Science Department School of Computer Science, Carnegie Mellon University
Checkpoint-Free Fault Tolerance for Kaige Liu M.S. Thesis December 2020
Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of embedding tables, and are trained by distributing the model across the memory of tens or hundreds of servers. Checkpointing is the predominant approach used for fault tolerance in these systems. However, it incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs. In this thesis, we present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode and where to place them in a training cluster, correctly and efficiently updates parities during normal operation, and recovers from failure without pausing training and while maintaining consistency of the recovered parameters. The design of ECRM enables training to proceed without any pauses both during normal operation and during recovery. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training time overhead by up to 88%, recovers from failures significantly faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.
50 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |