CMU-CS-02-189
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-02-189

Compromising Privacy in Distributed Population-Based
Databases with Trail matching: A DNA Example

Bradley Malin, Latanya Sweeney

December 2002

CMU-CS-02-189.ps
CMU-CS-02-189.pdf


Keywords: Data privacy, anonymity, security, re-identification algorithms, databases


This paper is concerned with the privacy of person-specific data collected over multiple institutions. In particular, we focus on an example of person-specific DNA sequences collected and stored at various hospitals in a defined geographic region. The applications of human genetics and genomic analysis have generated much discussion with respect to privacy and confidentiality in ethical, legal, and social issues. For the most part, the previous analysis has concentrated on direct application and disclosure of the genetic information of an individual, however, there has been much less attention devoted to the question of computational challenges to privacy in the secondary sharing of de-identified databases (i.e. released in a format devoid of directly identifying information, such as name, address, or phone number). We introduce methods for determining the re-identifiability of such DNA data and, in the process of doing so, prove that the removal of identifying information from DNA does not sufficiently protect the privacy of the entities to which the data was derived from. We demonstrate, through several novel re-identification algorithms, that despite a lack of personal demographic information, such database entries can be re-identified through linkage to other publicly available databases, such as hospital discharge information through the use of hospital visit and data collection patterns, which we refer to as data trails, which are iteratively discovered from released data collections. Using real-world data, we are able to determine when identifiable linkages can occur for a substantial number of individuals with particular gene-based disorders. Furthermore, we provide empirical analysis of the re-identification algorithms with respect to population-institution visit distributions and data trails.

23 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by [email protected]