CMU-ISR-17-109
Institute for Software Research
School of Computer Science, Carnegie Mellon University



CMU-ISR-17-109

The Utility of Corporate Comparison
for Generating Delete Lists

Geoffrey P. Morgan

July 2017

Center for the Computational Analysis of Social and Organizational Systems
CASOS Technical Report

CMU-ISR-17-109.pdf

Keywords: Text Analysis, Corpus Comparison, Delete Lists, TF-IDF

Delete Lists are lists of words that have been determined to have little useful meaning for textual analysis. One subset of words that are frequently deleted are stop-words. Stop-Words are textual tokens, such as "and", "a", or "the", that provide structural or grammatical impact to a sentence but do not themselves have significant inherent meaning. Identifying stop-words is a routine process in most text-cleaning applications, but frequently is done via user-maintained word lists. I suggest that the corpora comparison technique I devised for word-score polarization can be used to identify low-value words while preserving the bulk of the text tokens. I will use both known and random draw corpora comparisons for this process. By "known" corpora, I mean corpora drawn from explicit data-sources, the emails of one company and the emails of another, for example. "Random-Draw" corpora are created by drawing document sets at random, and therefore this technique could be applied to any sufficiently large text corpus of interest. I use the ability to identify stop words as a proxy for performance in generating useful delete lists. Random-Draw and Known Corpora Comparison techniques outperform an iteration of TF-IDF (Term Frequency - Inverse Document Frequency), which performs quite poorly on this email data.

16 pages


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]