CMU-CS-18-127
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-18-127

Framework Design for Improving Computational Efficiency
and Programming Productivity for Distributed Machine Learning

Jin Kyu Kim

Ph.D. Thesis

December 2018

CMU-CS-18-127.pdf


Keywords: Distributed Systems, Large-Scale Machine Learning, Programming Framework, Computer Science

Machine learning (ML) methods are used to analyze data in a wide range of areas, such as finance, e-commerce, medicine, science, and engineering, and the size of machine learning problems has grown very rapidly in terms of data size and model size in the era of big data. This trend drives industry and academic communities toward distributed machine learning that scales out ML training in a distributed system for completion in a reasonable amount of time. There are two challenges in implementing distributed machine learning: computational efficiency and programming productivity. The traditional data-parallel approach often leads to suboptimal training performance in distributed ML due to data dependencies among model parameter updates and nonuniform convergence rates of model parameters. From the perspective of an ML programmer, distributed ML programming requires substantial development overhead even with high-level frameworks because they require an ML programmer to switch to a different mental model for programming from a familiar sequential programming model.

The goal of my thesis is to improve the computational efficiency and programming productivity of distributed machine learning. In an efficiency study, I explore model update scheduling schemes that consider data dependencies and nonuniform convergence speeds of model parameters to maximize convergence per iteration and present a runtime system STRADS that efficiently execute model update scheduled ML applications in a distributed system. In a productivity study, I present familiar sequential-like programming API that simplifies conversion of a sequential ML program into a distributed program without requiring an ML programmer to switch to a different mental model for programming and implement a new runtime system STRADS-Automatic Parallelization(AP) that efficiently executes ML applications written in our API in a distributed system.

Thesis Committee:
Garth A. Gibson (Co-Chair)
Eric P. Xing (Co-Chair)
Phillip Gibbons
Joseph E. Gonzalez (Universiyt of California Berkeley)

Srinivasan Seshan, Head, Computer Science Department
Tom. M. Mitchell, Interim Dean, School of Computer Science


138 pages



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]