摘要
•Complementary enrichment strategies combined with membrane filtration and C8 SPE.•A combined database with the comprehensive putative SEPs and canonical proteins used.•Seven hundred sixty-two novel SEPs identified from human cell lines, mouse cell lines, and mouse tissues.•Nineteen SEPs have been validated by fusion expression or synthetic peptides. Many small ORFs embedded in long noncoding RNA (lncRNA) transcripts have been shown to encode biologically functional polypeptides (small ORF-encoded polypeptides [SEPs]) in different organisms. Despite some novel SEPs have been found, the identification is still hampered by their poor predictability, diminutive size, and low relative abundance. Here, we take advantage of NONCODE, a repository containing the most complete collection and annotation of lncRNA transcripts from different species, to build a novel database that attempts to maximize a collection of SEPs from human and mouse lncRNA transcripts. In order to further improve SEP discovery, we implemented two effective and complementary polypeptide enrichment strategies using 30-kDa molecular weight cutoff filter and C8 solid-phase extraction column. These combined strategies enabled us to discover 353 SEPs from eight human cell lines and 409 SEPs from three mouse cell lines and eight mouse tissues. Importantly, 19 of them were then verified through in vitro expression, immunoblotting, parallel reaction monitoring, and synthetic peptides. Subsequent bioinformatics analysis revealed that some of the physical and chemical properties of these novel SEPs, including amino acid composition and codon usage, are different from those commonly found in canonical proteins. Intriguingly, nearly 65% of the identified SEPs were found to be initiated with non-AUG start codons. The 762 novel SEPs probably represent the largest number of SEPs detected by MS reported to date. These novel SEPs might not only provide new clues for the annotation of noncoding elements in the genome but also serve as a valuable resource for functional study. Many small ORFs embedded in long noncoding RNA (lncRNA) transcripts have been shown to encode biologically functional polypeptides (small ORF-encoded polypeptides [SEPs]) in different organisms. Despite some novel SEPs have been found, the identification is still hampered by their poor predictability, diminutive size, and low relative abundance. Here, we take advantage of NONCODE, a repository containing the most complete collection and annotation of lncRNA transcripts from different species, to build a novel database that attempts to maximize a collection of SEPs from human and mouse lncRNA transcripts. In order to further improve SEP discovery, we implemented two effective and complementary polypeptide enrichment strategies using 30-kDa molecular weight cutoff filter and C8 solid-phase extraction column. These combined strategies enabled us to discover 353 SEPs from eight human cell lines and 409 SEPs from three mouse cell lines and eight mouse tissues. Importantly, 19 of them were then verified through in vitro expression, immunoblotting, parallel reaction monitoring, and synthetic peptides. Subsequent bioinformatics analysis revealed that some of the physical and chemical properties of these novel SEPs, including amino acid composition and codon usage, are different from those commonly found in canonical proteins. Intriguingly, nearly 65% of the identified SEPs were found to be initiated with non-AUG start codons. The 762 novel SEPs probably represent the largest number of SEPs detected by MS reported to date. These novel SEPs might not only provide new clues for the annotation of noncoding elements in the genome but also serve as a valuable resource for functional study. Long noncoding RNAs (lncRNAs), a family of noncoding RNAs that are greater than 200 nucleotides in length and lack long or conserved ORFs, were formerly regarded as “junk RNAs.” Recently, however, a growing amount of evidence has demonstrated that many short or small ORFs (smORFs) embedded in lncRNA transcripts are able to encode functional polypeptides (smORFs-encoded polypeptides [SEPs]). These SEPs contain less than 100 amino acids in eukaryotes (50 amino acids in prokaryotes) and play vital regulatory roles in diverse physiological processes, including cancer growth (1Huang J.Z. Chen M. Chen fnm Gao X.C. Zhu S. Huang H. Hu M. Zhu H. Yan G.R. A peptide encoded by a putative lncRNA HOXB-AS3 suppresses colon cancer growth.Mol. Cell. 2017; 68: 171-171.e6Abstract Full Text Full Text PDF PubMed Scopus (294) Google Scholar), mucosal immunity (2Jackson R. Kroehling L. Khitun A. Bailis W. Jarret A. York A.G. Khan O.M. Brewer J.R. Skadow M.H. Duizer C. Harman C.C.D. Chang L. Bielecki P. Solis A.G. Steach H.R. et al.The translation of non-canonical open reading frames controls mucosal immunity.Nature. 2018; 564: 434-438Crossref PubMed Scopus (74) Google Scholar), and fatty acid β-oxidation (3Makarewich C.A. Baskin K.K. Munir A.Z. Bezprozvannaya S. Sharma G. Khemtong C. Shah A.M. McAnally J.R. Malloy C.R. Szweda L.I. Bassel-Duby R. Olson E.N. MOXI is a mitochondrial micropeptide that enhances fatty acid β-oxidation.Cell Rep. 2018; 23: 3701-3709Abstract Full Text Full Text PDF PubMed Scopus (64) Google Scholar). These findings have subverted our understanding of lncRNAs and expanded our knowledge of the coding potential of the genome. Moreover, the development of genomics and bioinformatics, in particular the advent of high-throughput sequencing technology, accelerated the discovery of thousands of additional lncRNA transcripts with smORFs. Considering such large numbers of lncRNAs and smORFs, it is expected that SEPs may represent a large albeit neglected portion of nonannotated peptides involved in diverse physiological process. Therefore, large-scale discovery and functional characterization of unknown SEPs might provide new clues for the annotation and functional analysis of noncoding elements in the genome and their effects on biological evolution. A variety of different methodologies, such as smORF predictions by computational sequence analysis, deep sequencing–based ribosome profiling, and MS-based proteomics, have been developed for the identification and characterization of SEPs across different biological samples. However, each of these strategies presents caveats. First of all, while bioinformatics analysis of lncRNA transcript sequences is a typical first step to predict the existence of smORFs, achieving high prediction sensitivity and specificity remains a significant challenge (4Pauli A. Valen E. Schier A.F. Identifying (non-)coding RNAs and small peptides: challenges and opportunities.Bioessays. 2015; 37: 103-112Crossref PubMed Scopus (68) Google Scholar, 5Cohen S.M. Everything old is new again: (linc)RNAs make proteins!.EMBO J. 2014; 33: 937-938Crossref PubMed Scopus (24) Google Scholar). Furthermore, despite the power of deep sequencing–based ribosome profiling for the identification of the region of active translation in lncRNA transcripts, it nevertheless only provides indirect evidence of translation (6Aspden J.L. Eyre-Walker Y.C. Phillips R.J. Amin U. Mumtaz M.A. Brocard M. Couso J.P. Extensive translation of small open reading frames revealed by Poly-Ribo-seq.Elife. 2014; 3e03528Crossref PubMed Scopus (186) Google Scholar, 7Guttman M. Russell P. Ingolia N.T. Weissman J.S. Lander E.S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins.Cell. 2013; 154: 240-251Abstract Full Text Full Text PDF PubMed Scopus (502) Google Scholar, 8Chew G.L. Pauli A. Rinn J.L. Regev A. Schier A.F. Valen E. Ribosome profiling reveals resemblance between long non-coding RNAs and 5′ leaders of coding RNAs.Development. 2013; 140: 2828-2834Crossref PubMed Scopus (181) Google Scholar). Finally, while MS-based proteomics directly identifies SEPs by detecting the peptides generated from smORFs embedded in lncRNA transcripts (9Slavoff S.A. Mitchell A.J. Schwaid A.G. Cabili M.N. Ma J. Levin J.Z. Karger A.D. Budnik B.A. Rinn J.L. Saghatelian A. Peptidomic discovery of short open reading frame-encoded peptides in human cells.Nat. Chem. Biol. 2013; 9: 59-64Crossref PubMed Scopus (375) Google Scholar, 10Ma J. Ward C.C. Jungreis I. Slavoff S.A. Schwaid A.G. Neveu J. Budnik B.A. Kellis M. Saghatelian A. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue.J. Proteome Res. 2014; 13: 1757-1765Crossref PubMed Scopus (99) Google Scholar, 11Budamgunta H. Olexiouk V. Luyten W. Schildermans K. Maes E. Boonen K. Menschaert G. Baggerman G. Comprehensive peptide analysis of mouse brain Striatum identifies novel sORF-encoded polypeptides.Proteomics. 2018; 18e1700218Crossref PubMed Scopus (21) Google Scholar), the number of SEPs identified by MS from different biological samples is still small (12Pueyo J.I. Magny E.G. Couso J.P. New peptides under the s(ORF)ace of the genome.Trends Biochem. Sci. 2016; 41: 665-678Abstract Full Text Full Text PDF PubMed Scopus (55) Google Scholar). The relatively low number of SEPs detected by MS is largely attributed to the fact that this type of detection is still analytically challenging. First of all, because of the actual low concentration and small size of SEPs, an accurate, consistent, and comprehensive measurement can be quite challenging and significantly affected by sample preparation. Even though multiple methods are available to enrich SEPs from different biological samples by fractionating or removing highly abundant and large proteins to reduce sample complexity, the various physical and chemical properties of different SEPs are often overlooked by different methods, which may negatively bias their discovery. Second, the identification of SEPs using MS is achieved by matching them against the theoretical spectra of all candidate peptides present in a reference protein sequence database. Crucially, this implies that the strategy behind the generation of a reference database can dramatically impact the identification of novel SEPs. For example, the most straightforward approach is six-frame translation of the entire genome. Unfortunately, such a dataset is difficult to use because of its extremely large size and the abundant presence of unknown protein sequences. While it is possible to, alternatively, create a smaller database by utilizing RNA transcripts from RNA-Seq or Ribo-Seq data, this strategy only captures actively translated RNA transcripts and mainly relies on sequencing depth. In the present study, we address the challenges presented previously by developing an effective SEP enrichment workflow through the integration of two complementary enrichment methods based on 30-kDa MWCO filter and C8 solid-phase extraction (SPE) column. This approach allowed us to build a robust SEP database containing all putative smORFs from lncRNA transcripts deposited in the NONCODE database, a strategy that significantly improves the discovery of SEPs from different cell lines and tissues. We subsequently employed multiple technologies to experimentally validate the existence of these SEPs. Human HeLa, human embryonic kidney 293T (HEK293T), 22Rv1, Du145, LNCap, PC3, and A375 cells were cultured in Dulbecco's modified Eagle's medium (DMEM) (Gibco) supplemented with 10% (v/v) fetal bovine serum (Gibco) and 1% (v/v) penicillin/streptomycin (Gibco). Human U251 cells were cultured in DMEM/Nutrient Mixture F-12 (Gibco) supplemented with 10% (v/v) fetal bovine serum and 1% (v/v) penicillin/streptomycin. Mouse 4T1 cells were cultured in RPMI1640 (Gibco) supplemented with 10% (v/v) fetal bovine serum and 1% (v/v) penicillin/streptomycin. Mouse embryonic fibroblast (MEF) and mouse embryonic stem cell (mESC) D3 cells were obtained from the stem cell core facility at the Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences (Shanghai, China). mESCs were grown in MEFs treated with mitomycin C. mESCs were cultured in DMEM supplemented with 15% (v/v) fetal bovine serum and 1% (v/v) penicillin/streptomycin, plus 2 Mm l-glutamine, 0.1 mM 2-mercaptoethanol, 0.1 mM nonessential amino acids, and 103 units/ml mouse leukemia inhibitory factor. Twelve-week-old male and female mice (C57BL/6J) were obtained from the Animal Core Facility at the Institute of Biophysics, Chinese Academy of Sciences. All animal protocols were approved by the Animal Care and Use Committee of the Institute of Biophysics, Chinese Academy of Sciences. For total cell protein extraction, ~1 × 106 cells were resuspended with 100 μl extraction buffer (8 M urea/100 mM NH4HCO3) containing Protease Inhibitor Cocktail Tablets (Roche), followed by sonication for 24 bursts with a 50% duty cycle (Scientz-IID), and then the supernatant was carefully collected after centrifugation at 20,000g at 4 °C for 20 min. For whole tissue protein extraction, ~20 mg tissues were cut into small pieces and homogenized with 500 μl extraction buffer (8 M urea/100 mM NH4HCO3) containing Protease Inhibitor Cocktail Tablets. The lysate was sonicated for 24 bursts with a 50% duty cycle, and the remaining debris was removed by centrifugation at 20,000g for 20 min at 4 °C. SEP enrichment from cell samples was performed with 30-kDa MWCO filters by resuspending ~1 × 107 cells in 500 μl ice-cold water containing Protease Inhibitor Cocktail Tablets. After three bursts of sonication with a 50% duty cycle, the mixture was heated at 95 °C for 5 min and then cooled down on ice for a few more minutes. Subsequently, 0.1 N ice-cold HCl was added to the sample to a final concentration of 10 mM and incubated on ice for 10 min. After centrifugation at 20,000g for 20 min at 4 °C in a bench-top centrifuge, the supernatant was filtered through a 30-kDa MWCO filter (Millipore), and the flow through was collected and evaporated to dryness by vacuum centrifugation at 4 °C. The pellet was then dissolved in 50 μl 8 M urea/100 mM NH4HCO3. We performed SEP enrichment from tissue with 30-kDa MWCO filters by cutting ~200 mg frozen tissue into small pieces and then homogenizing in 500 μl ice-cold water containing Protease Inhibitor Cocktail Tablets. The subsequent steps were the same as described previously for SEP enrichment from cells. In order to perform SEP extraction and enrichment from cell samples using C8 SPE columns, we used acidic lysis buffer containing detergent and C8 SPE columns. We slightly modified the C8 SPE method following previously described protocols (13Vale W. Vaughan J. Jolley D. Yamamoto G. Bruhn T. Seifert H. Perrin M. Thorner M. Rivier J. Assay of growth hormone-releasing factor.Methods Enzymol. 1986; 124: 389-401Crossref PubMed Scopus (45) Google Scholar, 14Ma J. Diedrich J.K. Jungreis I. Donaldson C. Vaughan J. Kellis M. Yates J.R. Saghatelian A. Improved identification and analysis of small open reading frame encoded polypeptides.Anal. Chem. 2016; 88: 3967-3975Crossref PubMed Scopus (68) Google Scholar). Specifically, ~1 × 107 cells were lysed in 1 ml lysis buffer (50 mM HCl, 0.1% β-mercaptoethanol, and 0.05% Triton X-100) containing Protease Inhibitor Cocktail Tablets for 30 min at room temperature. After centrifugation at 20,000g for 20 min at 4 °C, the supernatant was collected. Subsequently, Bond Elute C8 silica cartridges (Agilent Technologies) were prepared with one-column volume of methanol and two-column volumes of triethylammonium formate buffer (pH 3.0) before the lysate was applied. Enriched SEPs were eluted, in turn, with 400 μl of 25%, 50%, and 75% acetonitrile (ACN) in triethylammonium formate buffer. The eluted fractions were then combined and concentrated to less than 100 μl at 4 °C by vacuum concentrator. Finally, enriched SEPs were precipitated with chloroform/methanol to remove residual detergent, and the precipitate was dissolved in 50 μl 8 M urea/100 mM NH4HCO3. For SEP enrichment from tissue samples using C8 SPE column, ~200 mg of frozen tissue was initially cut into small pieces and homogenized in 1.5 ml lysis buffer (50 mM HCl, 0.1% β-mercaptoethanol, and 0.05% Triton X-100) containing Protease Inhibitor Cocktail Tablets. The subsequent steps were the same as described previously for SEP extraction from cell samples. An aliquot of 20 μg proteins was dissolved with loading buffer (50 mmol/l Tris-HCl, pH 6.8, 2% SDS, 10% glycerol, 0.1% bromophenol blue, and 1% β-mercaptoethanol). After denaturation for 5 min at 95 °C, the protein samples were loaded onto homemade 10% tricine SDS-PAGE gels (15Schagger H. Tricine-SDS-PAGE.Nat. Protoc. 2006; 1: 16-22Crossref PubMed Scopus (1830) Google Scholar) and ran at 120 V for 80 min. The gel was stained with One-Step Blue Protein Gel Stain (BIOTIUM) and then washed with distilled water. Proteins/SEPs were reduced with 10 mM dithiothreitol (37 °C, 1 h), alkylated with 20 mM iodoacetamide (at room temperature, in the dark, for 45 min), after which they were digested overnight with trypsin (Promega) at a ratio of 1:50 (enzyme/protein, w/w) at 37 °C in less than 2 M urea/100 mM NH4HCO3. Formic acid (FA) was added to the digested samples with 0.1% final concentration to stop the reaction. The tryptic peptide sample was then desalted using Pierce C18 Tips (Thermo Fisher Scientific) with 0.1% FA. The peptides were eluted with 50 μl of 20% ACN/0.1% FA, 50 μl of 40% ACN/0.1% FA, and 50 μl of 60% ACN/0.1% FA. The eluted peptide solutions were combined and evaporated to dryness by vacuum concentrator. Digested peptides were analyzed by LC–tandem MS (LC–MS/MS) by combining an Easy-nLC1000 (Thermo Fisher Scientific) with a Q Exactive Mass Spectrometer (Thermo Fisher Scientific). A 100 μm × 2 cm trap column packed with Reprosil-Pur C18 5 μm particles (Dr Maisch GmbH) and a 75 μm × 25 cm analytical column packed with Reprosil-Pur C18 3 μm particles (Dr Maisch GmbH) were used to separate the peptides with mobile phase A (0.1% FA in water) and mobile phase B (0.1% FA in ACN) at a 78 min gradient: 5 to 8% B in 8 min, 8 to 22% B in 50 min, 22 to 32% B in 12 min, 32 to 95% B in 1 min, and then kept B at 95% for 7 min. The flow rate was set as 300 nl/min. The Q Exactive Mass Spectrometer was operated in a data-dependent acquisition mode with a spray voltage of 2 kV and a heated capillary temperature of 320 °C. MS1 data were collected at a high resolution of 70,000 (m/z 200) with a mass range of 300 to 1600 m/z, a target value of 3e6 and a maximum injection time of 60 ms. For each full MS scan, the 20 most abundant precursor ions were selected for MS2 with an isolation window of 2 m/z and the higher energy collision dissociation with normalized collision energy of 27. MS2 spectrums were collected at a resolution of 17,500 (m/z 200). The target value was 5e4 with a maximum fill time of 80 ms and a dynamic exclusion time of 40 s. We downloaded lncRNA transcripts for human (NONCODE V4) and mouse (NONCODE 2016) from the NONCODE database (http://www.noncode.org/). The ORFfinder and six-frame translation were employed to ensure we could detect all possible smORFs, which were then considered putative SEPs. We built SEP databases for human and mouse by collecting all putative SEPs with a length of 5 to 100 amino acids. The LC–MS/MS raw data were analyzed with Thermo Scientific Proteome Discoverer (version 1.4) using the SEQUEST HT search engine. Four different protein databases were used in this study. The details of these databases are as follows: (1) Homo sapiens canonical protein database, downloaded from the Uniprot Web site on February 2, 2018 and consisting of 93,637 entries; (2) Mus musculus canonical protein database, downloaded from the Uniprot Web site on February 2, 2018 and consisting of 61,314 entries; (3) in-house putative human SEP database, including 3,969,981 entries; and (4) in-house putative mouse SEP database, including 8,710,195 entries. For identification of candidate novel peptides from the digests, data were searched against the merged database of corresponding species described previously, which included canonical protein database and in-house putative SEP database. The search space included all fully tryptic and semitryptic peptides. Other common searching parameters were set as follows: peptides with a maximum of two missed cleavages were considered; the mass tolerance of precursor and product ions was set as 10 ppm and 0.02 Da, respectively; Carbamidomethylation on cysteine was considered as static modification; Oxidation on methionine was selected as dynamic modification; For protein identification, we set a significance threshold of p < 0.05 (with 95% confidence) and a false discovery rate <1%, which was estimated using a target-decoy search strategy. For data derived from nondigested samples, no enzyme was chosen. For identification of canonical proteins, data were searched against the Uniprot protein database of corresponding species. Other common searching parameters were set as mentioned previously. All parallel reaction monitoring (PRM) experiments were performed on the same LC–MS/MS system as aforementioned. In this study, 21 SEPs were randomly selected from 196 SEPs identified in HEK293T cells for PRM analyses. The theoretically predicted and identified tryptic peptides in the selected endogenous SEPs were chosen for PRM analyses with a semitargeted data acquisition approach in order to verify the identified SEPs. Briefly, using high-resolution data-dependent scanning, an extensive MS1 fragmentation inclusion list of the theoretically predicted and identified tryptic peptides in the selected endogenous SEPs was generated to confirm the identified peptides and discover novel peptides in the selected endogenous SEPs. The peptides, which are identical to annotated proteins or nonunique in the putative SEP database, were excluded from the list. A total of 53 peptide targets (corresponding to 21 SEPs) were generated in the inclusion list. To further validate the identification of novel SEPs, 15 standard peptides from 14 SEPs were synthesized by GenScript Biotech Corporation and analyzed on the same LC–MS/MS instruments as mentioned previously. In order to generate fusion protein constructs for the SEP ORF and enhanced GFP (EGFP), we amplified SEP ORF sequences without the endogenous 5′ UTR using RT-PCR and cloned them into a pEGFPmut-N1 vector in which the GFP start codon (ATGGTG) was mutated to ATTGTT (pEGFPmut). A list of the primers used in this study is available in supplemental Table S1. HEK293T cells were transfected with the plasmid constructs using Lipofectamine TM 2000 (Invitrogen; 11668-019) according to manufacturers' instructions. Total RNA was extracted from cells using the Trizol Total RNA Isolation Reagent (Invitrogen). RNA levels of GFP, GFPmut, and SEP ORF-EGFPmut were detected by RT-PCR. A list containing all the primers used in this study is available in supplemental Table S2. Western blotting was performed according to standard protocols. The primary antibodies used in this study were obtained as follows: anti-GFP (ABclonal Technology; AE012), anti-β-tubulin (Yeasen Tech; 30303ES50), anti-NONHSAT130014+unORF+2+peptide9, and anti-NONHSAT077882+1+orf4 were customized and raised by GeneScript Biotech Corporation. HEK293T cells were transfected with SEP ORF-EGFPmut, EGFPwt, and EGFPmut vectors for 24 h, and GFP fluorescence was directly visualized and recorded. HEK293T cells were plated on glass coverslips and then fixed with 4% paraformaldehyde, permeabilized with 0.5% Triton X-100, incubated with anti-NONHSAT077882+1+orf4 antibodies and, subsequently, incubated with Goat Anti-Rabbit IgG H&L (Alexa Fluor 488). Cellular nuclei were stained with 4′,6-diamidino-2-phenylindole. To test the two different SEP enrichment methods implemented, we performed and analyzed three technical replicates per method using the same cell or tissue samples. Data were analyzed by a two-tailed unpaired Student's t test (unless otherwise indicated), and p < 0.05 was chosen as the statistical limit of significance. We chose a notation of ∗, ∗∗, and ∗∗∗ for p < 0.05, p < 0.01, and p < 0.001, respectively. Unless otherwise indicated, all the data in the figures were represented as arithmetic means ± the standard deviations from at least three independent experiments. As discussed previously, the inherent low abundance and small sizes of SEPs contribute to their poor detectability, whereby it is critical to carefully consider sample preparation and build putative SEP reference databases from MS-based analysis in order to improve SEP discovery from different biological samples. MS-based database searching is the key step for MS-based SEP identification. In order to build a SEP database that could maximally cover all the putative SEPs in human and mouse, we scanned lncRNA transcripts deposited in the NONCODE database (http://www.noncode.org/), an interactive repository that currently represents the most complete collection and annotation of noncoding RNAs, especially lncRNAs. Specifically, lncRNA transcripts were scanned by ORFfinder and six-frame translation mode to make it possible to obtain all possible smORFs, which were then assumed to represent putative SEPs (Fig. 1A). This resulted in 3,969,981 and 8,710,195 polypeptides in the newly constructed human and mouse putative SEP databases, respectively. To verify the quality of these two databases, we chose recently reported functional SEPs, including myoregulin (16Anderson D.M. Anderson K.M. Chang C.L. Makarewich C.A. Nelson B.R. McAnally J.R. Kasaragod P. Shelton J.M. Liou J. Bassel-Duby R. Olson E.N. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance.Cell. 2015; 160: 595-606Abstract Full Text Full Text PDF PubMed Scopus (677) Google Scholar), myomixer (17Bi P. Ramirez-Martinez A. Li H. Cannavino J. McAnally J.R. Shelton J.M. Sánchez-Ortiz E. Bassel-Duby R. Olson E.N. Control of muscle formation by the fusogenic micropeptide myomixer.Science. 2017; 356: 323-327Crossref PubMed Scopus (173) Google Scholar), minion (18Zhang Q. Vashisht A.A. O'Rourke J. Corbel S.Y. Moran R. Romero A. Miraglia L. Zhang J. Durrant E. Schmedt C. Sampath S.C. Sampath S.C. The microprotein Minion controls cell fusion and muscle formation.Nat. Commun. 2017; 8: 15664Crossref PubMed Scopus (120) Google Scholar), SPAR (19Matsumoto A. Pasut A. Matsumoto M. Yamashita R. Fung J. Monteleone E. Saghatelian A. Nakayama K.I. Clohessy J.G. Pandolfi P.P. mTORC1 and muscle regeneration are regulated by the LINC00961-encoded SPAR polypeptide.Nature. 2017; 541: 228-232Crossref PubMed Scopus (334) Google Scholar), HOXB-AS3 (1Huang J.Z. Chen M. Chen fnm Gao X.C. Zhu S. Huang H. Hu M. Zhu H. Yan G.R. A peptide encoded by a putative lncRNA HOXB-AS3 suppresses colon cancer growth.Mol. Cell. 2017; 68: 171-171.e6Abstract Full Text Full Text PDF PubMed Scopus (294) Google Scholar), NoBody (20D'Lima N.G. Ma J. Winkler L. Chu Q. Loh K.H. Corpuz E.O. Budnik B.A. Lykke-Andersen J. Saghatelian A. Slavoff S.A. A human microprotein that interacts with the mRNA decapping complex.Nat. Chem. Biol. 2017; 13: 174-180Crossref PubMed Scopus (128) Google Scholar), and LINC-PINT (21Zhang M. Zhao K. Xu X. Yang Y. Yan S. Wei P. Liu H. Xu J. Xiao F. Zhou H. Yang X. Huang N. Liu J. He K. Xie K. et al.A peptide encoded by circular form of LINC-PINT suppresses oncogenic transcriptional elongation in glioblastoma.Nat. Commun. 2018; 9: 4475Crossref PubMed Scopus (322) Google Scholar) and BLAST-ed them against our newly assembled database. All these SEPs could be found within our putative SEP database, which attests the high quality, accuracy, and comprehensiveness of our database. The isolation and enrichment of SEPs from biological samples is another critical step for their characterization. Accordingly, various methodologies have been applied for this purpose, including 30-kDa MWCO filter, C8 SPE, and organic solvent–based or inorganic salt–based precipitation. Among these, the 30-kDa MWCO filter and C8 SPE are commonly used albeit based on different principles. In the case of 30-kDa MWCO filter, SEPs are separated and enriched based on their molecular size and/or weight. On the contrary, selective adsorption and selective elution are utilized to enrich, separate, and purify SEPs using C8 SPE. Therefore, we hypothesized these may represent two complimentary strategies for SEP enrichment, and that their combined use could significantly improve SEP discovery. We tested our hypothesis by enriching SEPs from equal amounts of HEK293T cell lysates using both 30-kDa MWCO filter and C8 SPE and then performing SDS-PAGE and LC–MS/MS analysis based on our in-house database (Fig. 1, B and C). Tricine SDS-PAGE showed that both 30-kDa MWCO filter and C8 SPE are very effective in enriching for proteins/peptides in the molecular weight range between 5 and 15 kDa (Fig. 2A). LC–MS/MS data further confirmed these results by showing that 12.6% and 13.2% of the total identified annotated proteins in the 30-kDa MWCO filter and C8 SPE approaches, respectively, were low molecular weight proteins/peptides (≤100 aa), in comparison to only 7.1% in total lysates without enrichment (Fig. 2C). Importantly, an average of 30 and 29 candidate SEPs were identified from the 30-kDa MWCO filter and C8 SPE, respectively, which are both significantly higher than the 21 candidates identified using total lysates without enrichment (Fig. 2D). Interestingly, and as expected, given the complimentary nature of the two approaches, there are only a few overlaps between SEP candidates identified with 30-kDa MWCO filter and C8 SPE (Fig. 2F), despite the observed comparable enrichment efficiency. Similar results showing low overlap between the two methods were obtained in mouse kidney lysate, HeLa, and MEF cell lysate (supplemental Fig. S1, A and C–E). This is likely the result of differences in enrichment efficiency according to SEP hydrophobicity between the two methods, since protein hydrophobicity analysis s