CMU-ISR-17-118 Institute for Software Research School of Computer Science, Carnegie Mellon University
Towards Automatic Classification of Privacy Policy Text Fredrick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, Norman Sadeh December 2017
Superseded by Institute for Sofware Research
Privacy policies notify Internet users about the privacy practices of websites, mobile apps, and other products and services. However, users rarely read them and struggle to understand their contents. Also, the entities that provide these policies are sometimes unmotivated to make them comprehensible. Recently, annotated corpora of privacy policies have been introduced to the research community. They open the door to the development of machine learning and natural language processing techniques to automate the annotation of these documents. In turn, these annotations can be passed on to interfaces (e.g., web browser plugins) that help users quickly identify and understand relevant privacy statements. We present advances in extracting privacy policy paragraphs (termed segments in this paper) and individual sentences that relate to expertidentified categories of policy contents, using methods in supervised learning. In particular, we show that relevant segments and sentences can be classified with average micro-F1 scores of 0.79 and 0.70 respectively, improving over prior work. We discuss how the techniques introduced in this paper have been used to automatically annotate the text of about 7,000 privacy policies. Our discussion highlights opportunities as well as limitations associated with our classification approach.
11 pages
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |