CSGSA

GRaDS talk: A machine learning approach to identify novel antifungal targets in Candida albicans

by Xiang Zhang on 2021-10-22

Abstract

Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans. Understanding which genes are essential for the growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, Saccharomyces cerevisiae, the construction of mutants is considerably more laborious in C. Albicans. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 C. Albicans GRACE (gene replacement and conditional expression) strains that have been previously constructed as a basis for training. We identified several relevant features contributing unique information to the predictions. Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we prioritized the construction of an additional set of >800 strains and discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for the efficient discovery of essential genes, and a similar approach may also be useful in other species.Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans. Understanding which genes are essential for the growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, Saccharomyces cerevisiae, the construction of mutants is considerably more laborious in C. Albicans. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 C. Albicans GRACE (gene replacement and conditional expression) strains that have been previously constructed as a basis for training. We identified several relevant features contributing unique information to the predictions. Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we prioritized the construction of an additional set of >800 strains and discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for the efficient discovery of essential genes, and a similar approach may also be useful in other species.

About the speaker

Xiang Zhang is a second-year Ph.D. student at the Department of Computer Science & Engineering, advised by Dr. Chad Myers. His research focuses on statistical and machine learning approaches for integrating diverse genomic data to address biological questions, such as making inferences about biological networks and predicting gene functions. He also loves classical music and plays the violin.