|
CMU-CS-02-189
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-02-189
Compromising Privacy in Distributed Population-Based
Databases with Trail matching: A DNA Example
Bradley Malin, Latanya Sweeney
December 2002
CMU-CS-02-189.ps
CMU-CS-02-189.pdf
Keywords: Data privacy, anonymity, security, re-identification
algorithms, databases
This paper is concerned with the privacy of person-specific data
collected over multiple institutions. In particular, we focus on an
example of person-specific DNA sequences collected and stored at
various hospitals in a defined geographic region. The applications of human genetics and genomic analysis have generated much discussion with respect
to privacy and confidentiality in ethical, legal, and social issues.
For the most part, the previous analysis has concentrated on direct
application and disclosure of the genetic information of an individual,
however, there has been much less attention devoted to the question of
computational challenges to privacy in the secondary sharing of
de-identified databases (i.e. released in a format devoid of directly
identifying information, such as name, address, or phone number). We
introduce methods for determining the re-identifiability of such DNA
data and, in the process of doing so, prove that the removal of
identifying information from DNA does not sufficiently protect the
privacy of the entities to which the data was derived from. We
demonstrate, through several novel re-identification algorithms, that
despite a lack of personal demographic information, such database
entries can be re-identified through linkage to other publicly
available databases, such as hospital discharge information through
the use of hospital visit and data collection patterns, which we
refer to as data trails, which are iteratively discovered from
released data collections. Using real-world data, we are able to
determine when identifiable linkages can occur for a substantial
number of individuals with particular gene-based disorders. Furthermore,
we provide empirical analysis of the re-identification algorithms with
respect to population-institution visit distributions and data trails.
23 pages
|