CMU-CS-19-101
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-19-101

Mining Large Dynamic Graphs and Tensors

Kijung Shin

Ph.D. Thesis

February 2019

CMU-CS-19-101.pdf


Keywords: Data mining, graph mining, tensor mining, stream mining, graph stream, edge stream, tensor stream, streaming algorithms, approximation algorithms, distributed algorithms, out-of-core algorithms, external-memory algorithms, distributed streaming algorithms, MapReduce, Hadoop, triangle counting, graph summarization, graph compression, tensor decomposition, tucker decomposition, anomaly detection, fraud detection, k-cores, degeneracy, dense subtensor detection, behavior modeling, network game, sharable good game, progressive stages, WRS, TRI-FLY, COCOS, THINKD, SWEG, S-HOT, CORESCOPE, M-ZOOM, D-CUBE, DENSESTREAM, DENSEALERT, SGG, SGG-AC, SGGNASH, SWATT, SWATTFIT

Graphs are ubiquitous, representing a variety of information, ranging from who follows whom on online social networks to who reviews what on e-commerce sites. Many of these graphs are large (e.g., online social networks with over two billion active users) and dynamic (i.e., nodes and edges can be added and removed over time). Moreover, they are with rich side information (e.g., e-commerce reviews with timestamps, ratings, and text) and thus naturally modeled as tensors (i.e., multi-dimensional arrays).

Given large dynamic graphs and tensors, how can we analyze their structure? How can we detect interesting anomalies? Lastly, how can we model behaviors of individuals in the data? My thesis focuses on these closely related questions, all of which are fundamental to understand massive evolving data on user behavior. That is, we develop scalable algorithms for mining large dynamic graphs and tensors, with a focus on three tasks:

  1. Structure Analysis: We build one-pass, sublinear-space algorithms that incrementally estimate the triangle count, which is an important connectivity measure, in large dynamic graphs. In particular, our distributed algorithm yields up to 39 × more accurate estimates faster than a baseline. We also develop distributed and out-of-core algorithms for succinctly but accurately summarizing the structure of large graphs and tensors. They summarize over 25 × larger data without quality loss than their best competitors.

  2. Anomaly Detection: We develop near-linear time approximation algorithms for detecting unusually dense subgraphs and subtensors, which signal notable anomalies such as "edit wars" on Wikipedia and fake followers on Twitter. Especially, our tensor algorithm is up to 114 × faster without accuracy loss than the previously best heuristic. We also extend it for distributed or dynamic data with the same approximation guarantee.

  3. Behavior Modeling: We design game-theoretic models for purchases of individuals in social networks and a fast algorithm for finding Nash equilibria of the models. In addition, we develop a stage model for the progression of individuals and a distributed optimization algorithm for fitting the model to behavior logs with trillions of records. Using our tools, we measure social inefficiency regarding purchase of sharable goods and discover progression patterns of LinkeIn users.

To achieve the highest performance and scalability, our algorithms for the above tasks employ mathematical techniques (e.g., approximation and sampling), use distributed computing frameworks (e.g., MAPREDUCE and MPI), and/or exploit pervasive patterns in realworld data (e.g., power-law degree distribution). We successfully apply them to massive datasets, including 20:6 billion social connections on LinkedIn, 1:47 billion follow relations on Twitter, 783 million hyperlinks between web pages, 483 million edits on Wikipedia, and a synthetic tensor with 1 trillion non-zero entries

302 pages

Thesis Committee:
Christos Faloutsos (Chair)
Tom M. Mitchell
Memon Akoglu
Philip S. Yu (University of Illinois at Chicago)

Srinivasan Seshan, Head, Computer Science Department
Tom M. Mitchell, Interim Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]