CMU-CS-22-123 Computer Science Department School of Computer Science, Carnegie Mellon University
Elevating Jupyter Notebook Maintenance Tooling Yuan Jiang M.S. Thesis August 2022
Data analysis is an exploratory, interactive, and often collaborative process. Computational notebooks have become a popular tool to support this process, among others because of their ability to interleave code, narrative text, and results. The exploratory nature of computational notebooks allows their users to edit and execute parts of their program in any order. However, notebooks in practice are often criticized as hard to maintain and being of low code quality, including problems such as unused or duplicated code and out-of-order code execution. Data scientists can benefit from better tool support when maintaining and evolving notebooks. We argue that central to such tool support is identifying the structure of notebooks. We present a lightweight and accurate approach to extract notebook structure and outline several ways such structure can be used to improve maintenance tooling for notebooks, including navigation and finding common structural patterns. In addition, we investigate the history of notebooks and extend our approach to visualize how notebooks evolve over multiple revisions. We measure statistics of changed, added, and removed cells in Kaggle notebooks with history versions. Our formative study shows our visualizations can be useful for tracing and understanding changes in notebook evolution and identifying alternatives explored in specific stages of a data analysis pipeline over notebook histories.
43 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |