Share news in:
9 October, 2020
Author: Szymon Wojciechowski
Author: Szymon Wojciechowski

Reading the genome – predictions of phenotypes

Reading the genome – predictions of phenotypes


The accessibility of microbial genomes that stem from high-throughput sequencing enables the application of a new class of tools to the problem of inferring phenotypic traits of bacterial strains. Machine learning algorithms that have been trained on thousands of genomes can substitute or support laboratory experiments to determine the phenotypes of newly discovered bacteria. They can do this not only accurately but also at scale and low cost. Consequently, these methods may lead to faster breakthrough discoveries within tight safety requirements.

Phenotypic traits – why could they be of importance?

Microbes are all around us and constitute not only a considerable part of ourselves (with around 1 bacterial cell for each human cell in the body, as recent studies suggest [1]), but they can also be found in every environment, and so-called extremophiles can even be found in places that are extremely hostile to life [2]. Knowing more about each and every bacterial strain out there may seem like an intriguing task on its own, but finding out about microorganisms’ phenotypes is far more practical. ‘Phenotype’ is a general term that is used to describe the collection of physiological features of a given organism, including its physical form, structure, biochemical characteristics or developmental processes [3]. One concrete phenotypic trait of bacteria is their resistance to antibiotics or lack thereof. As such, with about 35,000 deaths caused each year by antibiotic-resistant bacteria in the US alone [4] and almost as many in Europe [5], it would be hugely beneficial to discover the susceptibilities of given pathogenic strains to the bactericidal drugs at our disposal. This knowledge could drive down the number of victims and the severity of a potential pandemic could be countered as early as possible.

Protection against illness-causing microorganisms is one obvious reason to pose questions about phenotypes, but this is not the only reason. The scientific community is finding more and more evidence that microbes can be used to our advantage in treating diseases [16, 17], as early indicators of ailments [18, 19, 21], or  as prebiotics [20, 21]. In order to study such interactions, it is crucial to have access to the strains that could be used in any of these areas. However, a strain needs to be cultivable; in other words, we need to have a way to multiply it in laboratories and deliver it to researchers for further experiments. Culturing (cultivation) of strains entails providing them with the optimal growth conditions for their phenotype. In some cases, determining these conditions can be an extremely time-consuming process. In other cases, animal cells may be indispensable, as they constitute a scaffolding for bacteria to grow on [6]. Knowing phenotypic traits upfront can accelerate the cultivation process and make it cheaper.

Eventually, knowledge of phenotypic traits may save lives in rare cases. Let us assume a scenario in which a person is infected with a bacterium, but we do not know the origin of the infection. Nevertheless, from the pathogen’s phenotype we could deduce the origin or significantly reduce the number of possible options. This may result in finding out that the bacteria most likely came from pork that was eaten by the patient, from a beach that he visited yesterday, or is an after-effect of a mosquito bite. We may then counteract and curb the spread of this pathogen.

In silico research as a means of supporting wet labs.

Currently, phenotypic traits are discovered for each strain separately in the laboratory through numerous experiments of varying complexity (for instance [7]). Naturally, we also employ the genotype of a given microbe to determine its phenotypic traits in some relatively easy cases [8]. One such prominent case is the gene tet(W/N/W), which conveys resistance to tetracycline [9]. This is an efficient method which unfortunately has one significant drawback: the beneficial effect of this gene has to be observed and confirmed beforehand in a classic experiment. Hence, we are merely spreading what is already known to new cases, but this approach does not lead to new insights. What if we could predict a phenotypic trait without assuming anything at all?

This is achievable by means of machine learning and – in a broader sense – artificial intelligence. These terms are interchangeably (mostly wrongly [10]) used to denote a class of sophisticated algorithms whose aim is to find the connection (mapping) between a number of known characteristics of an object (in our case, a bacterium’s genotype) and an unknown trait that we seek to explain. Naturally, in our problem, we want to discover a ‘recipe’ to determine phenotypic traits. Application of these methods is possible thanks to the rapid increase in the number of genome sequences we have at our disposal. This, in turn, was facilitated by improvements in high-throughput sequencing methods and DNA-extraction protocols. On top of these advances, researchers have managed to compile extensive microorganism databases such as NCBI [11] and BacDive [12]. These databases gather the genomes of bacteria alongside corresponding metadata such as phenotypic characteristics. Moreover, they are very convenient in the development of machine learning algorithms.

What can be achieved?

At Ardigen, we are trying to find microbiome-derived therapeutics that can be administered to patients so as to enhance their chances of successful treatment. However, this must be done with great care since the administration of a pathogenic microbe will not help, to put it mildly. Therefore, as a part of our Microbiome Translational Platform we have developed an in silico engine to determine the phenotypic characteristics of given strains. In order to achieve this, we amassed publicly available genomes from open databases and crunched some numbers to build predictive models for numerous traits. Currently, we are capable of predicting not only some very basic characteristics such as gram staining and oxygen requirements for growth, but also some more challenging ones, such as biosafety of cultivation [13] and temperature and pH required for culturing.

No collection of such models would be complete without an endeavor to predict antimicrobial resistance (AMR), i.e. whether a given strain is susceptible to a particular drug or not. As long ago as 2001, the Joint Food and Agriculture Organization of the United Nations/World Health Organization Expert Consultation on Evaluation of Health and Nutritional Properties of Probiotics published a set of guidelines regarding the acceptance of bacteria as probiotics. One of the key features of these guidelines was safety assessment comprising, among others, patterns of antimicrobial drug resistance [14]. It is therefore essential to be able to determine a strain’s resistance profile as early and accurately as possible.

We have trained over 40 hypothesis-free models (one for each antibiotic). By ‘hypothesis-free’, we mean that we did not feed the models any pre-existing knowledge on the possible links between genomes and phenotypes. Everything had to be learnt from scratch by just looking at pure sequences of nucleotides. Eventually, we obtained very accurate predictions with for all but one antibiotic. ROC AUC > 0.9 was reached in more than 30 cases. ROC AUC is a standard machine learning measure ranging from 0 to 1, with 0.5 denoting a non-informative model and values closer to 1 being better. For a binary problem (in our case, resistant vs susceptible), these outcomes are widely considered very good by researchers in this area [22–24]. In cases where the data is abundant enough, we can also successfully predict MIC (minimum inhibitory concentration) values.

The range of problems that can be solved by machine learning is not constrained to the options we have described. Experimentally, we are also developing models for the prediction of more intangible features such as engraftment to human tissue, which is a prerequisite for an effective microbiome-derived therapeutic that accompanies cancer therapies [15]. Additionally, we are also working on a model that can reliably predict the composition of the medium on which a strain should be cultured.


Deciphering nature, while undoubtedly fascinating, has tangible benefits for the life and wellbeing of mankind. At Ardigen, we know that utilizing computational tools such as machine learning, which has already proved its value in numerous areas of research, could make these benefits easier to attain. We hope that the phenotypes of microbes will soon be demystified.



[1] Sender, Ron & Fuchs, Shai & Milo, Ron. (2016). Are We Really Vastly Outnumbered? Revisiting the Ratio of Bacterial to Host Cells in Humans. Cell. 164. 10.1016/j.cell.2016.01.013.

[2] Rampelotto P.H. Extremophiles and extreme environments. Life. 2013;3:482–485. doi: 10.3390/life3030482.

[3] Dawkins, Richard (12 January 1978). “Replicator Selection and the Extended Phenotype”. Ethology. 47 (1 January–December 1978): 61–76. doi:10.1111/j.1439-0310.1978.tb01823.x

[4] Centers for Disease Control and Prevention. ANTIBIOTIC RESISTANCE IN THE UNITED STATES 2019. Dec. 2019.

[5] Cassini, Alessandro, et al. “Attributable Deaths and Disability-Adjusted Life-Years Caused by Infections with Antibiotic-Resistant Bacteria in the EU and the European Economic Area in 2015: A Population-Level Modelling Analysis.” The Lancet Infectious Diseases, vol. 19, no. 1, Jan. 2019, pp. 56–66, 10.1016/s1473-3099(18)30605-4.

[6] Bochner, Barry R, et al. “Important Discoveries from Analysing Bacterial Phenotypes.” Molecular Microbiology, vol. 70, no. 2, 1 Oct. 2008, pp. 274–280,, 10.1111/j.1365-2958.2008.06383.x.

[7] Nouioui, Imen, et al. “Two Novel Species of Rapidly Growing Mycobacteria: Mycobacterium Lehmannii Sp. Nov. and Mycobacterium Neumannii Sp. Nov.” International Journal of Systematic and Evolutionary Microbiology, vol. 67, no. 12, 1 Dec. 2017, pp. 4948–4955, 10.1099/ijsem.0.002350.

[8] Crofts, Terence S., et al. “Next-Generation Approaches to Understand and Combat the Antibiotic Resistome.” Nature Reviews Microbiology, vol. 15, no. 7, 10 Apr. 2017, pp. 422–434,, 10.1038/nrmicro.2017.28.

[9] Leclercq, Sébastien Olivier, et al. “Diversity of the Tetracycline Mobilome within a Chinese Pig Manure Sample.” Applied and Environmental Microbiology, vol. 82, no. 21, 1 Nov. 2016, pp. 6454–6462,, 10.1128/AEM.01754-16.

[10] Marr, Bernard. “What Is The Difference Between Artificial Intelligence And Machine Learning?” Forbes, 6 Dec. 2016, Accessed 29 Sept. 2020.

[11] National Center for Biotechnology Information; [1988] – [Accessed 29 Sept. 2020]. Available from:

[12] BacDive in 2019: bacterial phenotypic data for High-throughput biodiversity analysis Reimer, L. C., Vetcininova, A., Sardà Carbasse, J., Söhngen, C., Gleim, D., Ebeling, C., Overmann, J. Nucleic Acids Research; database issue 2019.

[13] Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures GmbH. “German Collection of Microorganisms and Cell Cultures GmbH: Safety Information.”, Accessed 29 Sept. 2020.

[14] Venugopalan, Veena, et al. “Regulatory Oversight and Safety of Probiotic Use.” Emerging Infectious Diseases, vol. 16, no. 11, 2010, pp. 1661–5,, 10.3201/eid1611.100574.

[15] Smillie, Christopher S., et al. “Strain Tracking Reveals the Determinants of Bacterial Engraftment in the Human Gut Following Fecal Microbiota Transplantation.” Cell Host & Microbe, vol. 23, no. 2, Feb. 2018, pp. 229-240.e5,, 10.1016/j.chom.2018.01.003.

[16] Flickinger, John, et al. “Listeria Monocytogenes as a Vector for Cancer Immunotherapy: Current Understanding and Progress.” Vaccines, vol. 6, no. 3, 25 July 2018, p. 48, 10.3390/vaccines6030048.

[17] Lamm, D. L., et al. “A Randomized Trial of Intravesical Doxorubicin and Immunotherapy with Bacille Calmette-Guérin for Transitional-Cell Carcinoma of the Bladder.” The New England Journal of Medicine, vol. 325, no. 17, 24 Oct. 1991, pp. 1205–1209,, 10.1056/NEJM199110243251703.

[18] Pezo, Rossanna C., et al. “Impact of the Gut Microbiota on Immune Checkpoint Inhibitor-Associated Toxicities.” Therapeutic Advances in Gastroenterology, vol. 12, 16 Sept. 2019,, 10.1177/1756284819870911.

[19] Zeller, Georg, et al. “Potential of Fecal Microbiota for Early-Stage Detection of Colorectal.” Molecular Systems Biology, vol. 10, no. 11, 28 Nov. 2014,, 10.15252/msb.20145645.

[20] Marcobal, A., and J.L. Sonnenburg. “Human Milk Oligosaccharide Consumption by Intestinal Microbiota.” Clinical Microbiology and Infection, vol. 18, July 2012, pp. 12–15, 10.1111/j.1469-0691.2012.03863.x.

[21] Frankel, Arthur E., et al. “Cancer Immune Checkpoint Inhibitor Therapy and the Gut Microbiota.” Integrative Cancer Therapies, vol. 18, 23 Apr. 2019,, 10.1177/1534735419846379.

[22] Hosmer, David W, et al. Applied Logistic Regression. Hoboken, Nj, Usa John Wiley & Sons, Inc, 2013.

[23] Mandrekar, Jayawant N. “Receiver Operating Characteristic Curve in Diagnostic Test Assessment.” Journal of Thoracic Oncology, vol. 5, no. 9, Sept. 2010, pp. 1315–1316,, 10.1097/jto.0b013e3181ec173d.

[24] Khouli, Riham H. El, et al. “The Relationship of Temporal Resolution to Diagnostic Performance for Dynamic Contrast Enhanced (DCE) MRI of the Breast.” Journal of Magnetic Resonance Imaging : JMRI, vol. 30, no. 5, 1 Nov. 2009, p. 999,, 10.1002/jmri.21947.

22 Sept, 2020
PRISM - a writing assistant for the language of proteins