CMU-ML-09-101
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-09-101

Detecting Patterns of Anomalies

Kaustav Das

March 2009

Ph.D. Thesis

CMU-ML-09-101.pdf


Keywords: Machine learning, anomaly detection, pattern detection, Bayesian network, biosurveillance

An anomaly is an observation that does not conform to the expected normal behavior. With the ever increasing amount of data being collected universally, automatic surveillance systems are becoming more popular and are increasingly using data mining methods to detect patterns of anomalies. Detecting anomalies can provide useful and actionable information in a variety of real-world scenarios. For example, in disease monitoring, a timely detection of an epidemic can potentially save many lives.

The diverse nature of real-world datasets, and the difficulty of obtaining labeled training data make it challenging to develop a universal framework for anomaly detection. We focus on a key feature of most real world scenarios, that multiple anomalous records are usually generated by a common anomalous process. In this thesis we develop methods that utilize the similarity between records in these groups or patterns of anomalies to perform better detection. We also investigate new methods for detection of individual record anomalies, which we then incorporate into the group detection methods. A recurring feature of our methods is combinatorial search over some space (e.g. over all subsets of attributes, or over all subsets of records). We use a variety of computational speedup tricks and approximation techniques to make these methods scalable to large datasets. Since most of our motivating problems involve datasets having categorical or symbolic values, we focus on categorical valued datasets. Apart from this, we make few assumptions about the data, and our methods are very general and applicable to a wide variety of domains.

Additionally, we investigate anomaly pattern detection in data structured by space and time. Our method generalizes the popular method of spatiotemporal scan statistics to learn and detect specific, time-varying spatial patterns in the data. Finally, we show an efficient and easily interpretable technique for anomaly detection in multivariate time series data. We evaluate our methods on a variety of ideal world data sets including both real and synthetic anomalies.

174 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by [email protected]