Document Type
Thesis
First Faculty Advisor
TingTing Zhao
Second Faculty Advisor
Brian Blais
Keywords
knockoffs; significant gene expression; perturbation experiment; heterogeneity
Publisher
Bryant University
Rights Management
CC - BY
Abstract
Machine learning methods have been widely applied to the field of genomics and bioinformatics. Specifically utilizing novel machine learning algorithms to study gene-drug interactions has the potential to make a major positive impact on new drug discovery. It is possible that heterogeneity may exist within Vorinostat drug perturbation experiments due to the effects of the perturbations on the gene expressions. Thus, the challenge is to identify the most important genes in a high-dimensional setting while first identifying subpopulations to address population heterogeneity. In this work, clustering techniques are applied to first identify group sub-population structures in the gene expression changes across multiple Vorinostat perturbations. Next, statistical knockoffs are applied to identify important gene expression changes within each subpopulation with the theoretically guaranteed false discovery rate. Gaussian Mixture knockoff generation is used to construct negative controls and identify important genes across these subpopulations within the Vorinostat family and make comparisons. This research has the potential to aid future novel drug discoveries, along with enhancing the potential of drug repurposing within the field of Pharmacoeconomics. Identification of such gene-drug interactions can facilitate a better understanding of the mechanism of the disease and identify new drug targets. The results support the theory of heterogeneity, as two distinct clusters were discovered. Cluster zero appears to include a majority of genes that had positive coefficients after interacting with the Vorinostat perturbations, resulting in up-regulation in the expression of those genes. Cluster one consisted largely of genes that had negative coefficients after interacting with the Vorinostat treatment, indicating down-regulation in the expression of those genes.
Included in
Bioinformatics Commons, Data Science Commons, Genomics Commons
Comments
The dataset used is publicly available on the LINCS Portal, which is a National Health Common Fund Program made up of over 15 credible institutions such as Harvard Medical, Stanford, etc. The chosen dataset contains perturbation experiment observations on gene expression changes.