|
CMU-CS-00-171
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-00-171
Improving Trigram Language Modeling with the World Wide Web
Xiaojin Zhu, Ronald Rosenfeld
November 2000
CMU-CS-00-171.ps
CMU-CS-00-171.pdf
Keywords: Language models, speech recognition and synthesis,
Web-based services
We propose a novel method for using the World Wide Web to acquire
trigram estimates for statistical language modeling. We submit
an N-gram as a phrase query to web search engines. The search
engines return the number of web pages containing the phrase,
from which the N-gram count is estimated. The N-gram counts are
then used to form web-based trigram probability estimates. We
discuss the properties of such estimates, and methods to
interpolate them with traditional corpus based trigram estimates.
We show that the interpolated models improve speech recognition
word error rate significantly over a small test set.
17 pages
|