CMU-CB-12-103
Lane Center for Computational Biology
School of Computer Science, Carnegie Mellon University



CMU-CB-12-103

On the Identification and Investigation of
Homologous Gene Families, with Particular Emphasis on
the Accuracy of Multidomain Families

Jacob M. Joseph

August 2012

Ph.D. Thesis

CMU-CB-12-103.pdf


Keywords: Genomics, gene family, homology, gene duplication, multidomain, network rewiring, neighborhood correlation, homology network, domain mutual information, gene family classification

This dissertation addresses the identification and characterization of homologous gene families in large-scale, genomic data. Particular emphasis is paid to multidomain gene families, as multidomain sequences represent at least half of the sequence universe, but present an especially challenging case for family classification. Often, these sequences are excluded from analyses because they tend to interfere with classification performed with existing methods. This thesis develops the theoretical context for family classification of datasets that contain multidomain sequences, and demonstrates the implementation necessary for performing classification on large data sets.

Five primary results are presented in this work. First, a definition of homology that encompasses the evolutionary scenarios that result in multidomain families is formulated. Second, the techniques and implementation of family classification are presented. The methodology developed takes protein sequence data as input, and, by explicitly considering the evolutionary signal of gene duplication inherent in a sequence similarity network, derives a network that is an accurate estimate of homology. Third, the structure of this network is examined, and compared to the theoretical construct of a network of homology. Fourth, an approach for predicting families from this network is developed. Importantly, a statistical framework is presented for evaluation of the result using a limited set of curated families. Finally, the interplay between domains and the clustering result is examined using an information-theoretic approach.

218 pages



Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]