摘要
Machine Learning in Bioinformatics of Protein Sequences, pp. 81-127 (2023) No AccessChapter 4: NLP-based Encoding Techniques for Prediction of Post-translational Modification Sites and Protein FunctionsSuresh Pokharel, Evgenii Sidorov, Doina Caragea, and Dukka B KCSuresh PokharelDepartment of Computer Science, Michigan Technological university, Houghton, MI, USA, 49931, USA, Evgenii SidorovDepartment of Computer Science, Michigan Technological university, Houghton, MI, USA, 49931, USA, Doina CarageaDepartment of Computer Science, Kansas State University, Manhattan, KS, USA, 66502, USA, and Dukka B KCDepartment of Computer Science, Michigan Technological university, Houghton, MI, USA, 49931, USACorresponding author.https://doi.org/10.1142/9789811258589_0004Cited by:1 PreviousNext AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsRecommend to Library ShareShare onFacebookTwitterLinked InRedditEmail Abstract: With advancements in sequencing and proteomics approaches, computational functional annotation of proteins is becoming increasingly crucial. Among these annotations, prediction of post-translational modification (PTM) sites and prediction of function given a protein sequence are two very important problems. Recently, there have been several breakthroughs in Natural Language Processing (NLP) area. Consequently, we have observed an increase in the application of NLP-based techniques in the field of protein bioinformatics. In this chapter, we review various NLP-based encoding techniques for representation of protein sequences. Especially, we classify these approaches based on local/sparse encodings, distributed representation encodings, context-independent word embeddings, contextual word embedding and recent language models based pre-trained encodings. We summarize some of the recent approaches that make use of these NLP-based encodings for the prediction of various types of protein PTM sites and protein functions based on Gene Ontology (GO). Finally, we provide an outlook on possible future research directions for the NLP-based approaches for PTM sites and protein function predictions. FiguresReferencesRelatedDetailsCited By 1pLMSNOSite: an ensemble-based approach for predicting protein S-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language modelPawel Pratyush, Suresh Pokharel, Hiroto Saigo and Dukka B. KC8 February 2023 | BMC Bioinformatics, Vol. 24, No. 1 Machine Learning in Bioinformatics of Protein SequencesMetrics History PDF download