COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-02-116
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-02-116

Using Tarjan's Red Rule for Fast Dependency Tree Construction

Dan Pelleg, Andrew Moore

February 2002

Keywords: Machine learning, Bayes' networks, dependency trees, Hoeffding races, scalable data-mining

We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show significant improvement in running time, without loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan's red-edge rule, which is generally considered a guaranteed recipe for bad performance.

10 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by [email protected]