Understanding microbial proteins is essential in order to reveal the clinical potential of the microbiome as a whole. The application of novel high-throughput sequencing technologies allows the fast and inexpensive acquisition of a large number of potential protein sequences from a microbial community. However, many of these sequences do not resemble those of previously characterized proteins, so predicting their function through conventional alignment-based approaches remains challenging. Recent research has shown that deep learning – a collection of prominent Artificial Intelligence methods – may be the missing piece in solving this problem.
This is the second part in a series that aims to summarize recent advances in the application of deep learning in proteomics. To learn why deep learning and proteomics are crucial in microbiome research, read the first part here.
Proteins are sequences of amino acids and their processing is computationally similar to problems from the Natural Language Processing (NLP) domain, where sequences of words are analyzed. Whereas deep learning is already the undisputed leader of all NLP methods, the full extent of its advantages in proteomics have not yet been proven. However, the latest NLP breakthrough in 2018 [1-3] led to an avalanche of exciting new studies in proteomics that continues to this day. These breakthroughs are already being applied to proteomics and may lead to a new generation of deep learning-based methods in this field.
To clearly outline recent research on the topic, relevant studies will be split into two categories: protein annotation and protein representation.
Deep protein annotation (or protein classification) is a natural extension of traditional methods that aim to assign a label to newly sequenced proteins. This label is usually connected to an entry in a database of choice and may belong to human-curated ontologies (e.g., GO terms) or classification schemes (e.g., EC numbers). The choice of a database depends on the interests of researchers as each database represents different proteins’ characteristics.
However, the novelty of deep learning lies in the training procedure. Deep neural networks learn on all proteins at once using the complete data set. This leads to improved predictions as the models learn abstract relations between sequences, even if they are differently annotated. In contrast, HMM-based methods start by grouping the sequences according to annotations and training multiple HMM models, one for each class. The resulting HMM models only ever see the sequences that are specific to one class, so they cannot leverage the same patterns as the deep learning model.
In deep protein annotation, labels are required to train and guide a neural network: this type of learning is called “supervised”. Accordingly, studies in the last decade show that deep learning can successfully predict EC numbers [4-5], GO terms [6-11], PFAM families [12-14], or – and this deserves highlighting – many labels at once . This list is not exhaustive and only serves as a guide for further exploration. In fact, it can easily be extended, for instance, with works that aimed to predict more physical characteristics of proteins, such as their secondary structure.
All the aforementioned studies demonstrated the potential of deep learning in function prediction, but none of them truly transformed the field. The main reason for this is their dependence on labeled data, which are in short supply (see here) and strongly biased by human decisions and presumptions. It is us, scientists, who select proteins that we consider worth labeling and, likewise, it is us who creates label ontologies. To take full advantage of deep learning, we need to change the way we think about functional understanding of proteins.
Learning a protein representation departs from the typical approach to functional annotation. The underlying idea is to learn a mapping from protein sequences to points in an abstract space where a single point corresponds to one protein. This space is usually high dimensional so that the point can be expressed as a vector of coordinates (e.g. “[0.1, -0.4, 7.5, …, -3.5, 0.7]”) which contains all information extracted from the protein. This mapping is called protein embedding. The fundamental aspect of this space is that it tends to group proteins that are similar not in terms of sequences but in terms of higher-level concepts like function or structure.
These models are trained in an unsupervised fashion, so databases’ labels are unnecessary (though they can be used to aid unsupervised training). This means that these models are no longer limited by our current understanding of proteins. In particular, they may use currently unknown protein features to improve the representation. Significantly, as the model learns from raw sequences, it can be fed with massive databases like UNIREF or UNIPARC . The leading studies in this domain [17-20] advocate unsupervised learning in proteomics .
Representing a protein with an embedding may seem less precise than assigning it a specific label. Nevertheless, an embedding vector is exceptionally information-rich and strikingly universal.
First of all, using embedding to represent proteins as vectors allows us to use standard machine-learning algorithms based on distance functions (Euclidean, cosine, etc.). One possible application is the annotation of unknown proteins by looking at their nearest neighbors. This idea is known in NLP as Semantic Search, and Schwartz  and Senter  have shown that it delivers improved results for proteins with lower sequence similarity.
However, direct protein comparison is certainly not the most groundbreaking advantage of this approach. Vector representation allows further processing with machine learning algorithms. This way, protein annotation was enhanced by Elnaggar , who added a classification layer on top of a network that had been pre-trained in an unsupervised manner. Moreover, Senter  and Melidis  emphasize that even if proteins do not resemble those in a given database, their embeddings can be clustered to create groups of proteins with unknown but consistent functions.
Furthermore, Rao  and Rives  show that the physical characteristics of a protein, like secondary structure, contact prediction, stability, or fluorescence, can be accurately inferred from the same embedding vector.
Optimizing the protein engineering process is an even more creative example of employing protein embeddings. Alley  demonstrates that a deep learning model capable of turning a raw amino acid sequence into a function-describing vector makes it possible to maximize the desired function by manipulating the sequence.
The usage of deep learning to predict the tertiary structure of a protein is deliberately omitted here, as this closely related field deserves its own post.
These are the early days of adopting protein embeddings in function prediction, but these methods have already proven their tremendous potential in decoding protein sequences. Natural language processing is a subject of intensive research, and ground-breaking methods are developed every year. At the same time, bioinformatics is attracting growing interest in the context of precision medicine. All this can mean that not only will our understanding of the functions of proteins soon improve, but this will be achieved without a direct comparison of sequences to one another. The models will determine the properties of proteins just by analyzing their sequences.
 J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv [cs.CL], Oct. 11, 2018.
 M. E. Peters et al., “Deep contextualized word representations,” arXiv [cs.CL], Feb. 15, 2018.
 J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” arXiv [cs.CL], Jan. 18, 2018.
 Y. Li et al., “DEEPre: sequence-based enzyme EC number prediction by deep learning,” Bioinformatics, vol. 34, no. 5, pp. 760–769, Mar. 2018, doi: 10.1093/bioinformatics/btx680.
 Z. Zou, S. Tian, X. Gao, and Y. Li, “mlDEEPre: Multi-functional enzyme function prediction with hierarchical multi-label deep learning,” Front. Genet., vol. 9, p. 714, 2018, doi: 10.3389/fgene.2018.00714.
 M. Kulmanov, M. A. Khan, R. Hoehndorf, and J. Wren, “DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier,” Bioinformatics, vol. 34, no. 4, pp. 660–668, Feb. 2018, doi: 10.1093/bioinformatics/btx624.
 M. Kulmanov and R. Hoehndorf, “DeepGOPlus: improved protein function prediction from sequence,” Bioinformatics, vol. 36, no. 2, pp. 422–429, Jan. 2020, doi: 10.1093/bioinformatics/btz595.
 S. M. S. Islam and M. M. Hasan, “DEEPGONET: Multi-label Prediction of GO Annotation for Protein from Sequence Using Cascaded Convolutional and Recurrent Network,” arXiv [cs.CV], Oct. 31, 2018.
 D. Duong et al., “Annotating Gene Ontology terms for protein sequences with the Transformer model,” bioRxiv, p. 2020.01.31.929604, Feb. 02, 2020.
 M. Nauman, H. Ur Rehman, G. Politano, and A. Benso, “Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins,” Int. J. Grid Util. Comput., Jul. 2018, doi: 10.1007/s10723-018-9450-6.
 A. Sureyya Rifaioglu, T. Doğan, M. Jesus Martin, R. Cetin-Atalay, and V. Atalay, “DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks,” Sci. Rep., vol. 9, no. 1, p. 7344, May 2019, doi: 10.1038/s41598-019-43708-3.
 X. Liu, “Deep Recurrent Neural Network for Protein Function Prediction from Sequence,” arXiv [q-bio.QM], Jan. 28, 2017.
 S. Seo, M. Oh, Y. Park, and S. Kim, “DeepFam: deep learning based alignment-free method for protein family modeling and prediction,” Bioinformatics, vol. 34, no. 13, pp. i254–i262, Jul. 2018, doi: 10.1093/bioinformatics/bty275.
 M. L. Bileschi et al., “Using Deep Learning to Annotate the Protein Universe,” bioRxiv, p. 626507, May 06, 2019.
 A. S. Schwartz et al., “Deep Semantic Protein Representation for Annotation, Discovery, and Engineering,” bioRxiv, p. 365965, Jul. 10, 2018.
 A. Bateman et al., “UniProt: the universal protein knowledgebase,” Nucleic Acids Res., vol. 45, no. D1, pp. D158–D169, Jan. 2017, doi: 10.1093/nar/gkw1099.
 R. Rao et al., “Evaluating Protein Transfer Learning with TAPE,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 9689–9701.
 E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified rational protein engineering with sequence-based deep representation learning,” Nat. Methods, vol. 16, no. 12, pp. 1315–1322, Dec. 2019, doi: 10.1038/s41592-019-0598-1.
 A. Rives et al., “Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences,” bioRxiv, p. 622803, Apr. 29, 2019.
 M. Heinzinger et al., “Modeling the Language of Life – Deep Learning Protein Sequences,” bioRxiv, p. 614313, Apr. 19, 2019.
 M. AlQuraishi, “The Future of Protein Science will not be Supervised,” Some Thoughts on a Mysterious Universe, Apr. 01, 2019. https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-science-will-not-be-supervised/ (accessed May 01, 2020).
 J. K. Senter, T. M. Royalty, A. D. Steen, and A. Sadovnik, “Unaligned Sequence Similarity Search Using Deep Learning,” arXiv [cs.LG], Sep. 16, 2019.
 A. Elnaggar, M. Heinzinger, C. Dallago, and B. Rost, “End-to-end multitask learning, from protein language to protein features without alignments,” bioRxiv, p. 864405, Jan. 24, 2020.
 D. P. Melidis, B. Malone, and W. Nejdl, “dom2vec: Assessable domain embeddings and their use for protein prediction tasks,” bioRxiv, p. 2020.03.17.995498, Mar. 18, 2020.
 E. Asgari and M. R. K. Mofrad, “Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics,” PLoS One, vol. 10, no. 11, p. e0141287, Nov. 2015, doi: 10.1371/journal.pone.0141287.
 E. Asgari, A. McHardy, and M. R. K. Mofrad, “Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX),” bioRxiv, p. 345843, Nov. 15, 2018.
 D. Kimothi, A. Soni, P. Biyani, and J. M. Hogan, “Distributed Representations for Biological Sequence Analysis,” arXiv [cs.LG], Aug. 21, 2016.
 D. Kimothi, P. Biyani, J. M. Hogan, A. Soni, and W. Kelly, “Learning supervised embeddings for large scale sequence comparisons,” bioRxiv, p. 620153, Apr. 26, 2019.
 D. Kimothi, P. Biyani, and J. M. Hogan, “Sequence representations and their utility for predicting protein-protein interactions,” bioRxiv, p. 2019.12.31.890699, Jan. 10, 2020.
 S. Min, S. Park, S. Kim, H.-S. Choi, and S. Yoon, “Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information,” arXiv [q-bio.BM], Nov. 25, 2019.