摘要
Bottom-up proteomics is a mass spectrometry-based method to analyze the contents of complex protein samples. Pioneered in the 1990s, it consists of converting protein samples into peptide samples by enzymatic digestion, the separation of peptides by (typically) reverse phase liquid chromatography (LC), and the analysis of the eluting peptides by tandem mass spectrometry. This general approach, while tremendously successful and widely used, has faced from the beginning the fundamental challenge that the number of peptides generated by the digestion of a complex protein sample, like cell extracts or body fluids, is significantly larger than the number of peptides expected by the application of the tryptic digestion rule [1]. In fact, the number of peptides expected from a proteome is presently unknown. Ironically, while the genes and transcripts could be comprehensively sequenced and characterized, the exact number of protein types or their cellular copy number in any biomedical sample remains unknown. The challenge to address this fundamental issue has spawned a large number of strategies for mass spectrometric data acquisition and analysis. Two major bottom-up proteomics approaches have been developed. Data-dependent acquisition (DDA) essentially prioritizes peptide precursors based on their signal intensity in a precursor ion scan in the mass spectrometer, and then subsequentially selects a number of precursors for fragmentation, generating MS2 spectra. This is a well-established MS method, which gains sample throughput when coupled with stable isotope-labeling of the peptides using, for example, TMTpro. However, since the number of peptide precursors is substantially larger than the number of fragment ion spectra a mass spectrometer can acquire, only a limited number of peptide precursors could be analyzed, leaving out a varying and unknown portion of the proteome uncharacterized in each DDA data acquisition. This undersampling issue becomes more pronounced when the LC gradient is minimized to maximize sample throughput. Therefore, it is unlikely that DDA data acquisition, even with extensive sample fractionation and extremely long LC gradient, will overcome this fundamental undersampling issue. Another emerging and widely adopted approach for bottom-up proteomics data acquisition is data-independent acquisition (DIA). DIA bins the peptide precursors into predefined groups based on their m/z values, performs fragmentation for each group (also called “window”) of peptide precursors sequentially, and records the highly convoluted MS2 spectrum for the fragments and unfragmented precursors in each window [2]. This method essentially generates a comprehensive digital map of all the flyable and fragmentable peptide precursors of a proteome. Therefore, compared with DDA which is inherently limited by the undersampling issue, it is theoretically possible to identify every protein in a proteome from a digital proteome map generated by DIA. Various computational methods have been developed to analyze data acquired by DIA. They can be grouped conceptually into peptide-centric and spectrum-centric approaches, the terminology of MacCoss and colleagues [3]. With the spectrum-centric approach, each tandem mass spectrum is interpreted by searching against a theoretical or experimental protein sequence database and a matched decoy database. This approach is usually used for DDA data. Principally, it can also be applied to the highly convoluted DIA data, but DIA data is most effectively interpreted with the peptide-centric approach, which basically asks the question: is a peptide of interest present in the data? Briefly, the peptide-centric approach first compiles the characteristics (including the m/z of peptide precursors and fragments, retention time and the elution profiles, among others) of a peptide precursor of interest into a data table (eg. reference spectral library), and tries to find this pattern in the DIA data using statistical and machine learning algorithms [4]. In principle, the combination of DIA data acquisition and peptide-centric data analysis strategy allows analysis of every protein which is analyzable in a proteome within the limits of the analytical techniques used. Since 2010, over 1000 publications have been published using DIA. This special issue features some of the latest advances in the field. Penny et al. reported a gas phase fractionation acquisition scheme called (ion mobility) IM - (gas phase fractionation) GPF, for rapid diaPASEF library generation [5]. Most DIA analyses are performed in single injections even for complex samples. The elimination of extensive sample fractionation not only minimizes technical variability and required sample amount, but also substantially increases the sample throughput. In this issue, Bons et al. applied DIA to study small amounts of extracellular matrix of lung cancer tissue specimens [6], while Wang et al. analyzed enriched glycoproteins in urine samples from prostate cancer patients [7]. Kverneland et al. developed a simple ultracentrifugation protocol for the enrichment of extracellular vesicles from plasma samples, enabling characterization of over 2500 plasma proteins with DIA runs of less than 1 h [8]. These three applications exemplify the superb sensitivity and comprehensiveness of DIA-MS for analyzing a specific subproteome. Oliinyk et al. reported that only 1 h MS time using dia-PASEF characterized over 13,000 phosphopeptides from about 20 ug protein digests, while shortening the gradient by a factor of 4 led to similar coverage of the phosphoproteome [9]. The type of application which requires both high sensitivity and high throughput is currently only practical with DIA-MS. Messner et al. argued that perturbation proteomics is an essential approach to study highly dynamic biological systems and that short-gradient DIA coupled with fast LC systems is the method of choice for such applications [10]. In addition, since DIA basically acquires peptide precursor and fragment data for all flyable and fragmentable ions, it has unique advantages for comprehensive analysis of protein PTMs. Yang et al. reviewed the current status of DIA-based PTM detection, site localization, and characterization of glycans [11]. In particular, they reviewed the contribution of deep learning in DIA library generation and data interpretation. Pham et al. reported an emerging deep learning algorithm called transformer architecture for retention time prediction which exhibited superb performance compared to multiple existing deep learning software tools [12]. DIA is effective in analyzing the proteome of small amounts of clinical tissue specimens in high-throughput with high degree of reproducibility and has been widely used in large cohorts [13]. Encouraged by the recent advances in DIA-based clinical proteomics, Boys et al. from ProCan asked the question: where are we in the context of (large-scale) clinical applications of MS-based proteomics? Even though DIA has been applied to identify disease diagnostic biomarkers and potential therapeutic targets, MS-based proteomics has as yet not been widely implemented in routine clinical diagnostics. They discussed the hurdles and proposed actions to move forward including integration of multiomics measurements, and development of targeted proteomic assays [14]. Poulos et al., from ProCan too, discussed the potential application of DIA-MS from a different angle, that is drug discovery [15]. High-throughput DIA analyses can be applied to analyze the perturbation proteome in tumor cells. The resultant proteomic data could be potentially used for drug responsiveness prediction using machine learning. Although we have not seen wide applications of MS-based proteomics in drug discovery, the potential is clear and indisputable. Toward clinical applications of DIA-based proteomics, standardization of sample preparation, MS data acquisition and data storage, as well as analysis is essential. Indeed, although the volume of DIA data is accumulating, discussion of DIA data management largely lags behind. Here, Jones et al. [16] discussed the findability, accessibility, interoperability, and reusability (FAIR) of the increasing volume of DIA data, and proposed expert recommendations for the future. Several exciting research fields which are also driven by DIA-based proteomics are not included in this special issue but are actively advancing, such as single-cell proteomics, spatial proteomics, and proteogenomics.