Machine Learning Models Could Help to Predict New Zoonotic Viruses - BioMed Advances

Most new human infectious diseases are caused by viruses that originated in animals that have made the jump to humans, with SARS-CoV-2 being the latest example. Identifying the next zoonotic virus early could allow scientists to learn about the virus in time to take steps to limit its spread, potentially preventing a global pandemic.

Once a virus has jumped from animals to humans it may well be too late to prevent a pandemic, but if viruses can be identified, assessed, and categorized based on their potential for making the jump from animals to humans, scientists could be given the time they need to conduct research and surveillance, and develop timely mitigations.

Researchers at the University of Glasgow believe machine learning models developed using viral genomes could be the solution. The machine learning models developed by researchers Nardus Mollentze, PhD, Simon Babayan, PhD, and Daniel Streicker, PhD have been used in one application that they claim could have identified SARS-CoV-2 as a high-risk coronavirus strain, and that this would have been possible without any prior knowledge of zoonotic SARS-related coronaviruses.

“Given the increasing use of genomics in virus discovery and the otherwise sparse knowledge of the biology of newly discovered viruses, we developed machine learning models that identify candidate zoonoses solely using signatures of host range encoded in viral genomes,” explained the researchers in the paper.

Estimates suggest there are around 1.67 million animal viruses, but only a small number of those viruses are capable of infecting humans. Determining which animal viruses are capable of infecting humans at the time of discovery is a major challenge, which makes it almost impossible to accurately prioritize the viruses that pose the highest risk to humans and investigate those viruses and prepare for potential outbreaks.

Since most viruses are discovered through untargeted genomic sequencing, limited phenotypic data is obtained. The best approach to take would be to quantify the relative risk of viruses infecting humans based on viral sequence data alone. This approach could be used to determine which viruses warrant further investigation. “Such predictions could alleviate the growing imbalance between the rapid pace of virus discovery and lower throughput field and laboratory research needed to comprehensively evaluate risk,” explained the researchers.

Machine learning models could be developed by training algorithms on viruses closely related to well-characterized human-infecting viruses. If secondary characteristics of the viral genome linked to infection capability are omitted, the models are less likely to find signals of zoonotic status that generalize across viruses.

The researchers instead developed machine learning models that use features engineered from viral and human genome sequences to predict the probability of an animal-infecting virus being able to infect humans given biologically relevant exposure.

The researchers used a dataset of 861 viral species across 36 virus families that had known zoonotic status and assigned each a probability of the potential for human infection based on virus taxonomy and relatedness to known human-infecting viruses. The model that performed best was used to analyze patterns in the predicted zoonotic potential of other virus genomes from a range of different animal species.

The researchers found their machine learning models significantly outperformed current alternatives that use phylogenetic relatedness of viruses to known human-infecting viruses. The researchers claim their models distinguished high-risk viruses within families that contain a minority of human-infecting species and identified putatively undetected or so far unrealized zoonoses.

“In requiring only a genome sequence, our approach has quantitative and qualitative advantages over alternative models for zoonotic risk assessment,” concluded the researchers.

You can read more about the research in the paper – Identifying and prioritizing potential human-infecting viruses from their genome sequences – which was recently published in PLOS Biology. DOI: 10.1371/journal.pbio.3001390