CMU-CS-16-130 Computer Science Department School of Computer Science, Carnegie Mellon University
Minimizing Queries for Active Labeling with Sequential Analysis Jack Paparian October 2016 M.S. Thesis
When building datasets for supervised machine learning problems, data is often labelled manually by human annotators. In domains like medical imaging, acquiring labels can be prohibitively expensive. Both active learning and crowdsourcing have emerged as ways to frugally label datasets. In active learning, there has been recent interest in algorithms that exploit the data's structure to direct querying. learning from crowds, one must balance the accuracy and cost of different oracles when gathering labels; weak oracles are assumed to be most accurate when labeling samples from label-homogeneous regions of space. In this thesis, we explore how the data's structure can be leveraged for both of these techniques. The sequential probability ratio test (SPRT) provides the backbone for our contributions. Using the SPRT, we provide a cluster-based active learning algorithm to find a small, homogeneous partitioning of the data. We also use the SPRT to measure the confidence of a weak oracle's label by analyzing its estimates on neighboring labels. The optimality of the SPRT allows the algorithms to inherently minimize the average number of queries required before their termination.
56 pages
Frank Pfenning, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection |