The emergence and rapid dissemination of antibiotic resistance worldwide threatens medical progress and calls for innovative approaches for the management of multi-drug resistant infections. Phage therapy might represent such an alternative. This re-emerging therapy uses viruses that specifically infect and kill bacteria during their life cycle to reduce/eliminate bacterial load and cure infections.
The success of phage therapy however relies on the exact matching between both the target pathogenic bacteria and the therapeutic phage. Therefore, having access to a fully-characterized phage library is necessary to start with phage therapy. An essential second step to conceive personalized phage therapy treatments is the capacity to predict the interactions between the target pathogen and its potential phage. To address this, we aim at developing predictive in silico models of phage-bacteria infection networks, using genomic features from sequenced phages and bacteria, and taking advantage of bioinformatics and machine learning techniques.
Using the publicly available information from Genbank and phagesdb.org, we were able to construct a dataset containing +1000 known phage-bacteria interactions with corresponding sequenced genomes. An equal amount of potential negative interactions were added to the dataset by considering the specificity of phage-bacteria interactions. We are currently extracting features from the genomes to build quantitative datasets to train machine learning models. These features include distribution of predicted protein-protein interaction scores, as well as proteins’ amino acids frequency and chemical composition. Future work will focus on the development of ensemble machine learning models to optimize the predictive power of our methodology.
Based on public data from GenBank and phagesDB.org, we collected more than a thousand positive phage-bacterium interactions with their complete genomes. In addition, we generated putative negative (i.e., non-interacting) pairs. We extracted, from the collected genomes, a set of informative features based on the distribution of predictive protein-protein interactions and on their primary structure (e.g. amino-acid frequency, molecular weight and chemical composition of each protein). With these features, we generated multiple candidate datasets to train our algorithms. On this base, we built predictive models exhibiting predictive performance of around 90% in terms of F1-score, sensitivity, specificity, and accuracy, obtained on the test set with 10-fold cross-validation.
These promising results reinforce the hypothesis that machine learning techniques may produce highly-predictive models accelerating the search of interacting phage-bacteria pairs.
- Leite, D. M. C. et al. Exploration of multiclass and one-class learning methods for prediction of phagebacteria interaction at strain level. in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 1818–1825 (2018). doi:10.1109/BIBM.2018.8621433.
- Leite, D., Brochet, X., Resch, G. et al. Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinformatics 19(Suppl 14), 420 (2018). https://doi.org/10.1186/s12859-018-2388-7