How Google is Improving Genomic Discovery With Machine Learning

Updated: Jul 1, 2021

Each person’s genome, which collectively encodes the biochemical machinery they are born with, is composed of over 3 billion letters of DNA. However, only a small subset of the genome (~4-5 million positions) varies between two people. Nonetheless, each person’s unique genome interacts with the environment they experience to determine the majority of their health outcomes. A key method of understanding the relationship between genetic variants and traits is a genome-wide association study (GWAS), in which each genetic variant present in a cohort is individually examined for correlation with the trait of interest. GWAS results can be used to identify and prioritize potential therapeutic targets by identifying genes that are strongly associated with a disease of interest, and can also be used to build a polygenic risk score (PRS) to predict disease predisposition based on the combined influence of variants present in an individual. However, while accurate measurement of traits in an individual (called phenotyping) is essential to GWAS, it often requires painstaking expert curation and/or subjective judgment calls.

In “Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology”, software developers from Google demonstrated how using machine learning (ML) models to classify medical imaging data can be used to improve GWAS. They describe how models can be trained for phenotypes to generate trait predictions and how these predictions are used to identify novel genetic associations. They then show that the novel associations discovered improve PRS accuracy and, using glaucoma as an example, that the improvements for anatomical eye traits relate to human disease.

Previous work has demonstrated that ML models can identify eye diseases, skin diseases, and abnormal mammogram results with accuracy approaching or exceeding state-of-the-art methods by domain experts. Because identifying disease is a subset of phenotyping, Google reasoned that ML models could be broadly used to improve the speed and quality of phenotyping for GWAS.

To test this, the developers chose a model that uses a fundus image of the eye to accurately predict whether a patient should be referred for assessment for glaucoma. This model uses the fundus images to predict the diameters of the optic disc (the region where the optic nerve connects to the retina) and the optic cup (a whitish region in the center of the optic disc). The ratio of the diameters of these two anatomical features (called the vertical cup-to-disc ratio, or VCDR) correlates strongly with glaucoma risk.

They applied this model to predict VCDR in all fundus images from individuals in the UK Biobank, which is the world’s largest dataset available to researchers worldwide for health-related research in the public interest, containing extensive phenotyping and genetic data for ~500,000 pseudonymized (the UK Biobank's standard for de-identification) individuals. The developers then performed GWAS in this dataset to identify genetic variants that are associated with the model-based predictions of VCDR.

The ML-based GWAS identified 156 distinct genomic regions associated with VCDR, which were compared to a VCDR GWAS conducted by another group on the same UK Biobank data, Craig et al. 2020, where experts had painstakingly labeled all images for VCDR. The ML-based GWAS replicates 62 of the 65 associations found in Craig et al., which indicates that the model accurately predicts VCDR in the UK Biobank images. Additionally, the ML-based GWAS discovered 93 novel associations.

To validate that the novel associations discovered in the ML-based GWAS are biologically relevant, Google developed independent PRSes using the Craig et al. and ML-based GWAS results, and tested their ability to predict human-expert-labeled VCDR in a subset of UK Biobank as well as a fully independent cohort (EPIC-Norfolk). The PRS developed from the ML-based GWAS showed greater predictive ability than the PRS built from the expert labeling approach in both datasets, providing strong evidence that the novel associations discovered by the ML-based method influence VCDR biology, and suggesting that the improved phenotyping accuracy (i.e., more accurate VCDR measurement) of the model translates into a more powerful GWAS.

As a second validation, because they know that VCDR is strongly correlated with glaucoma, the developers also investigated whether the ML-based PRS was correlated with individuals who had either self-reported that they had glaucoma or had medical procedure codes suggestive of glaucoma or glaucoma treatment. Google found that the PRS for VCDR determined using our model predictions were also predictive of the probability that an individual had indications of glaucoma. Individuals with a PRS 2.5 or more standard deviations higher than the mean were more than 3 times as likely to have glaucoma in this cohort. They also observed that the VCDR PRS from ML-based phenotypes was more predictive of glaucoma than the VCDR PRS produced from the extensive manual phenotyping.

These innovative developers have shown that ML models can be used to quickly phenotype large cohorts for GWAS, and that these models can increase statistical power in such studies. Although these examples were shown for eye traits predicted from retinal imaging, the teams looks forward to exploring how this concept could generally apply to other diseases and data types.

How Google is Improving Genomic Discovery With Machine Learning

Recent Posts

Comments