CMU-CS-06-126 Computer Science Department School of Computer Science, Carnegie Mellon University
Advanced Tools for Video and Multimedia Mining Jia-Yu Pan May 2006 Ph.D. Thesis
CMU-CS-06-126.ps.gz
How do we automatically find patterns and mine data in large multimedia databases, to make these databases useful and accessible? We focus on two problems: (1) mining "uni-modal patterns" that summarize the characteristics of a data modality, and (2) mining "cross-modal correlations" among multiple modalities. Uni-modal patterns such as "news videos have static scenes and speech-like sounds," and cross-modal correlations like "the blue region at the upper part of a natural scene image is likely to be the `sky'," could provide insights on the multimedia content and have many applications. For uni-modal pattern discovery, we propose the method "AutoSplit." AutoSplit provides a framework for mining meaningful "independent components" in multimedia data, and can find patterns in a wide variety of data modalities (e.g., video, audio, text, and time sequences). For example, in video clips, AutoSplit finds characteristic visual/auditory patterns, and can classify news and commercial clips with 81% accuracy. In time sequences like stock prices, AutoSplit finds hidden variables like "general growth trend" and "Internet bubble," and can detect outliers (e.g., lackluster stocks). Based on AutoSplit, we design a system, ViVo, for mining biomedical images. ViVo automatically constructs a visual vocabulary which is biologically meaningful and can classify 9 biological conditions with 84% accuracy. Moreover, ViVo supports data mining tasks such as highlighting biologically interesting image regions, for biomedical research. For cross-modal correlation discovery, we propose "MAGIC," a graph-based framework for multimedia correlation mining. When applied to news video databases, MAGIC can identify relevant video shots and transcript words for event summarization. On the task of automatic image captioning, MAGIC achieves a relative improvement of 58% in captioning accuracy as compared to recent machine learning techniques. 212 pages
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |