CMU-CS-20-109 Computer Science Department School of Computer Science, Carnegie Mellon University
Monaural Source Separation in the Wild Tianjun Ma M.S. Thesis May 2020
Monaural source separation refers to the process of extracting individual components from a mixture, where the mixture is a single-channel audio recording of multiple sources emitting sounds simultaneously, and the individual components are the constituent sounds emitted by each source. In recent years, data-driven approaches using deep neural network-based models for monaural source separation have been shown to outperform their non-data-driven counterparts. However, these approaches are designed using specialized datasets in which the sources belong to a constrained set of categories and the mixtures are not very representative of audio mixtures in the real world. Consequently, whether existing models could generalize to more complex source separation settings is open to questions. In this work, we want study and formalize the notion of monaural source separation in real-world scenarios and explore model designs that adapt to such complex settings. Specifically, we present the Wild-Mix Dataset, a synthetic dataset in which mixtures consist of sources belonging to a variety of sound categories and are synthesized in dynamic ways. We also present ASTNet, the first supervised learning model to utilize multi-headed attention to tackle monaural source separation. We show that the Wild-Mix Dataset is a challenging benchmark for evaluating model performance in complex real-world scenarios and that ASTNet achieves the state-of-the-art performance on the Wild-Mix Dataset. 24 pages
Thesis Committee
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |