MACHINE LEARNING TECHNICAL REPORT ABSTRACTS

	CMU-ML-10-107 Machine Learning Department School of Computer Science, Carnegie Mellon University CMU-ML-10-107 Tractable Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar July 2010 Ph.D. Thesis CMU-ML-10-107.pdf Keywords: Random walks, proximity measures, graphs, nearest neighbors, link prediction, algorithms Identifying the nearest neighbors of a node in a graph is a key ingredient in a diverse set of ranking problems, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. For finding these "near" neighbors, we need graph theoretic measures of similarity or proximity. Most popular graph-based similarity measures, e.g. length of shortest path, the number of common neighbors etc., look at the paths between two nodes in a graph. One such class of similarity measures arise from random walks. In the context of using these measures, we identify and address two important problems. First, we note that, while random walk based measures are useful, they are often hard to compute. Hence we focus on designing tractable algorithms for faster and better ranking using random walk based proximity measures in large graphs. Second, we theoretically justify why path-based similarity measures work so well in practice. For the first problem, we focus on improving the quality and speed of nearest neighbor search in real-world graphs. This work consists of three main components: first we present an algorithmic framework for computing nearest neighbors in truncated hitting and commute times, which are proximity measures based on short term random walks. Second, we improve upon this ranking by incorporating user feedback, which can counteract ambiguities in queries and data. Third, we address the problem of nearest neighbor search when the underlying graph is too large to fit in main memory. We also prove a number of interesting theoretical properties of these measures, which have been key to designing most of the algorithms in this thesis. We address the second problem by bringing together a well known generative model for link formation, and geometric intuitions. As a measure of the quality of ranking, we examine link prediction, which has been the primary tool for evaluating the algorithms in this thesis. Link prediction has been extensively studied in prior empirical surveys. Our work helps us better understand some common trends in the predictive performance of different measures seen across these empirical results. 193 pages

SCS Technical Report Collection School of Computer Science homepage This page maintained by [email protected]