At Ardigen our goal is to decode the microbiome for clinical success. Currently, investigations into the role of gut microbiota with connection to human health is gaining momentum, as numerous research projects and commercial ventures are being undertaken—especially in the case of cancer research. It all started with the discovery of correlations between microbial composition and therapy efficacy. Currently, the modulation of the gut microbiome by FMT has already been proven to impact treatment outcomes in patients [1, 2]. This gives hope for the development of microbiota-derived biomarkers and, ultimately, reliable therapies. To enable this, scientists are currently looking to determine the Mode of Action (MoA) of the microbiome – particular actionable features within the microbial haystack.
Keys to this discovery may be in utilisation of data from high-throughput screens. One of such screens is Shotgun Metagenome Sequencing (SMS), which provides data, enabling the understanding of functions performed by microorganisms and allowing for quantification of the low-abundant ones, for whom traditional 16S sequencing lacks depth. However, tools traditionally used in bioinformatics are not capable of fully exploiting this kind of complex data.
In microbiome research, just as in various other scientific domains, Machine Learning (ML) is getting a foothold and giving hope for the awaited breakthrough. But how common is it and how is it applied to metagenomic data? In this article, we review data science methods used in microbiome research on an example of recent papers utilizing SMS data from cancer patients undergoing immune checkpoint therapies [3-12].
Once the microbial samples leave sequencers and spectrometers, they typically end up as tables of numbers. Such data representations can be easily fed into an analytical software, be it statistical or machine learning (ML). All of the articles we have gathered for this review apply one of these two approaches.
The distinction between statistics and ML is a blurry one, as both fall into a wider field of data science. For the sake of clarity, we define (supervised) ML as a set of algorithms that predict certain outcomes on new data. For example, an ML algorithm would predict response to a therapy, based on the taxonomic data of a given patient. There are also unsupervised ML algorithms that find patterns in the data, but do not predict anything specific about the patient, these are often used for data visualization or quality control, but here we focus on the supervised methods.
On the other hand, the more traditional, statistical approach uses univariate analysis, where every patient’s feature (e.g. abundance of a specific taxa or a specific function) is separately analyzed with a statistical test. The specific test used depends on an application, so let’s take a closer look at which tests are applied in the papers that we found highly important to the field.
In cases where the patients can be divided into groups (e.g. responders and non-responders) one can use univariate statistical tests that compare the differences between these groups by looking at each feature (say, relative abundance of specific taxa) separately and outputting a p-value that describes the significance of between-group difference. Furthermore, to decide if a certain feature is statistically different between the groups, it is necessary to define a significance threshold, traditionally a p-value lower than 0.05.
Mann-Whitney is a specific example of such a test, used in [3, 5, 7, 10] to compare the abundance of certain genera in stool samples. A similar test was used in [8] to assess the differences between gene / MGS (metagenomic species) counts (alas the authors do not provide the name of the test used). Other variants of univariate tests were used in [3, 4, 7, 10]. Overall, the authors favor the use of non-parametric tests (like the Mann-Whitney test), which are a type of univariate tests that only use the ordering of the values, which is reasonable because non-parametric tests do not carry any assumptions about distribution of the underlying data.
Also, authors of many papers [3, 7, 9, 10] apply a correction for a multiple comparison problem (one of Holm, Bonferroni, Benjamin-Hochberg), which makes accepting the null hypothesis less likely, resulting in fewer taxa or OTUs passing the significance threshold. The corrections push the reasoning in a more conservative direction, limiting the number of false positives, which, in our view, should be preferred and applied whenever required.
Ultimately, when it comes to more clinical areas of research, one of the most popular statistical tests is the log-rank (Mantel-Cox) test to assess the significance of Kaplan-Meier estimator in survival analysis [5, 6, 7, 8, 9, 12]. However, these tests are applicable only if the survival data is collected.
While popular, univariate statistical tests have numerous shortcomings. For example, they work on each feature separately and may ignore a signal that manifests itself across multiple taxa, but is too weak when we look at each taxa individually. This is where a machine learning approach shines. Even the simplest ML model, like a linear regression, takes into account the influence of all the factors put together to predict the outcome. This way, we can deliver more reliable predictions and extract the relevant features in a more robust way. Once we have a trained model, we can also predict the independent variable (the variable of interest). Contrarily, in classical statistics, the notion of an independent variable occurs as well, but we do not want to predict it for new entities – it is rather used to stratify the observations in the dataset in order to characterize the subgroups separately and, eventually, compare them.
Compared with statistical methods, machine learning is not so commonly used across the reviewed articles, however there are some examples of ML-based ones.
One such example is [7], where authors trained a series of logistic regression models on binarized abundances (abundance=0 vs. abundance>0) of the taxa with the response status as dependent variable (also binary). Another example is the application of random forests in [10] to find the factors that influence the shift in gut microbiome. Ultimately, a series of machine learning algorithms has been trained in [12]. The authors implemented random forest, extra trees, SVM, elastic net and k-nearest neighbors.
One crucial part of applying ML in any context is the verification of how well the model performs. This is typically done by splitting the data into train and test sets (also called discovery and validation), where the model trained on the first set makes predictions on the other and its performance is evaluated. This should be repeated multiple times by randomly splitting the data into train and test sets to get a robust estimate of the performance. To some degree, this was done in [10, 12] to compare various models, but is too often skipped in research papers.
Both machine learning and statistical results are largely influenced by the pre-processing of the data. This often prevents us from directly comparing the results presented in various papers. For example, many authors use MetaPhlAn2 to perform taxonomic profiling [3, 6, 9, 10], but the software options may differ from publication to publication. Hence, some systematic deviations are to be expected. Furthermore, in [10] a competing pipeline – MetaOMineR was used as well. Secondly, the authors tend to remove infrequent features (taxa, OTUs, pathways etc). The notion of “infrequency” is fuzzy though: some researchers are more liberal and retain variables present in as little as 10% of samples [4], whereas others are much more strict and postulate 25% of presence [6, 7].
The addition of synthetic features is also used to improve analysis. The most common of these is sample diversity (the alpha- or beta-diversities). However, even though these are commonly used, there is no consistency between researchers on what exactly should be calculated. To take alpha diversity as an example, some authors prefer Shannon diversity [7, 10, 11], while others choose its inverse [5, 12] and some choose simple richness [6, 8, 9, 10], or Simpson diversity [10].
The field of microbiome research is only just starting to apply even the basic ML methods. Many works stop at calculating p-values and skip the weaker signals that may shed more light on the fuller biological picture that the data hides. These intricacies may be unraveled by re-analyzing the data with the skillful use of a more robust approach. This in turn may lead to a better understanding of therapies’ modes of action or the discovery of new biomarkers, and possibly even microbiome-related drugs supporting the therapy itself. That is why, we at Ardigen believe that applying ML to microbial data will increase the quality of the research and lead to breakthrough discoveries.