摘要
Peroxisomes are universal eukaryotic organelles hosting various metabolic pathways, with particularly diverse metabolic roles in plants (Pan et al., 2020). Severe peroxisomal dysfunction can lead to fatal genetic disorders in humans and embryonic lethality in plants (Hu et al., 2012; Honsho et al., 2020). The proteome and metabolism of peroxisomes vary significantly between species, organs, and developmental stages, and in response to different environmental conditions (Gabaldón, 2018; Pan & Hu, 2018; Yifrach et al., 2018; Corpas, 2019). Peroxisomes not only have shared but also species-specific functions (Sibirny, 2016; Kao et al., 2018; Pan & Hu, 2018; Pan et al., 2020). Mass spectrometry (MS)-based peroxisome proteomics studies in different Arabidopsis developmental stages and tissue types, such as etiolated seedlings, green leaves, senescent leaves, and cultured cells (Fukao et al., 2002, 2003; Reumann et al., 2007, 2009; Eubel et al., 2008; Quan et al., 2013; Pan et al., 2018; Pan & Hu, 2018), as well as in several other plant species, such as soybean, spinach, sweet pepper, and castor bean (Arai et al., 2008; Babujee et al., 2010; González-Gordo et al., 2022; Wrobel et al., 2023), have uncovered hundreds of peroxisomal proteins and significantly improved our understanding of these metabolically plastic organelles. Another powerful strategy in uncovering peroxisome functions is in silico identification of proteins containing peroxisome targeting signals (PTSs), which requires full defining of the PTSs (Lingner et al., 2011; Reumann, 2011; Nakanishi et al., 2024). This is of particular importance for plant peroxisomes, which possess a larger proteome and play more diverse metabolic and signaling roles than those in many other eukaryotes (Sibirny, 2016; Pan & Hu, 2018; Pan et al., 2020). Moreover, the peroxisome has emerged as a highly valuable compartment for organelle engineering, particularly in the fields of biological manufacturing and agriculture (Song et al., 2024). Thus, decoding PTSs will not only lay the foundation for comprehensively predicting lineage-specific peroxisomal proteins and metabolic pathways but also provide more accurate sequence information to localize enzymes and transporters to peroxisomes as desired. Peroxisomal membrane protein (PMP) targeting is believed to use two mechanisms: direct peroxisomal targeting and indirect targeting through the endoplasmic reticulum (ER; Cross et al., 2016; Mayerhofer, 2016). The targeting signal of PMPs has not been clearly defined yet. As one of the best-characterized plant PMPs in trafficking, the targeting signal for ascorbate peroxidase 3 (APX3) was found to lie within the C-terminal region including the transmembrane domain (Cross et al., 2016). By comparison, peroxisomal matrix proteins rely on two types of PTSs located at the C-terminus (PTS1) and N-terminus (PTS2). PTS1, carried by most peroxisomal matrix proteins at their extreme C-termini, was initially recognized as a simple tripeptide with a ‘canonical’ consensus of (S/A)-(K/R)-(L/M), and PTS2 is a nonapeptide with a loose consensus sequence (R/K)-(L/V/I)-(X)5-(H/Q)-(L/A) usually located within variable distances near the N-terminus (Kunze, 2020). However, more and more ‘noncanonical’ derivatives of PTS1 have been discovered, demonstrating the complexity of PTS1 and the likely existence of many unknown PTS1 tripeptides (Lametschwandtner et al., 1998; Brocard & Hartig, 2006; Lingner et al., 2011; Reumann & Chowdhary, 2018). Recently, using a large-scale statistical analysis followed by experimental validation, we determined that for weak PTS1 tripeptides, a c. 12-amino acid auxiliary peptide (AuxP) upstream of PTS1 plays a supportive but pivotal role in peroxisome targeting, with RILVRTKRPRPR as the likely strongest AuxP in plants (Deng et al., 2022). In the same study, we identified 12 additional PTS1 peptides, increasing the number of validated plant PTS1s to 54 (Deng et al., 2022). Most of these known PTS1s are divergent in sequences and poorly conserved, indicating that PTS1 tripeptides may be far from completely identified. Machine learning (ML) has proven to be powerful in predicting protein motifs (Savojardo et al., 2023). In ML-based motif prediction, constructing a sizable training set and properly encoding the input data are critical for the final output. However, poorly conserved motifs like PTS1 tend to contain divergent sequences that are under-represented in empirical experimental data based on known functional proteins (Brocard & Hartig, 2006; Deng et al., 2022). Moreover, creating a dataset using experimental data can be laborious and costly, which limits the amount of experimentally validated data to be used for the training set. To provide sufficient and accurate training information for ML-assisted prediction of the extremely diversified PTS1 sequences, we devised a ML-assisted method utilizing two aspects of the information obtained from the PTS1 tripeptides: their frequency of appearance in proteins and peroxisome targeting data with amino acid substitutions. This cost-effective strategy enabled us to achieve the goal of comprehensively identifying PTS1 in plants. Apart from the 54 known PTS1 tripeptides, we were able to additionally validate 445 PTS1s and predict the existence of another 348. This study has significantly deepened our understanding of PTS1, the most important ‘postal code’ for a major metabolic hub in all eukaryotic cells, and thus will facilitate the discovery of proteins residing in peroxisomes of various plant species and likely nonplant species as well. Our ML-assisted strategy may also be applicable to studies to define other protein motifs. The ‘mutual best-match’ dataset of 20 712 proteins was generated as described previously (Deng et al., 2022). Briefly, 362 species covering all the main clades of angiosperms were selected from species with completely sequenced genomes (https://www.plabipd.de), including 88 monocots, 263 eudicots, and 11 others (Supporting Information Table S1). A protein was selected only if it was the ‘mutual best-match’ in the two-way Blast search between Arabidopsis and the plant species from which the protein was identified. The targeting plasmid mVenus-AuxP-3aa was transformed into Agrobacterium tumefaciens strain GV3101 (pMP90) by freeze-thaw (Höfgen & Willmitzer, 1988). Transient protein expression in tobacco (Nicotiana tabacum) leaves followed by confocal microscopy was carried out as described previously (Deng et al., 2022) to analyze protein targeting. A previously generated 35S promoter-driven moxCerulean3-enhancer-PTS1(SKL) protein in the pGWB545 vector (Deng et al., 2022) was used as the peroxisome marker. A Fluoview FV3000 confocal laser-scanning microscope (Olympus, Tokyo, Japan) was used for image capturing, where mVenus was excited with 514-nm lasers and detected at 530–630 nm, and moxCerulean3 was excited with 445-nm lasers and detected at 460–500 nm. The primers used to construct vectors for expressing the mVenus-AuxP-3aa fusion proteins are shown in Table S2. The ‘ML-set’, a 3aa dataset containing 224 PTS1 and 149 non-PTS1 peptide sequences, was generated as described in the Results section. Data storage and processing were conducted using Pandas (v.2.1.4; McKinney, 2010) and Numpy (v.1.24.3; Harris et al., 2020). The PTS1 tripeptides each possessed a specific eFrequency, that is the frequency (number of times) of its appearance in the 3aa library, of ≥ 1. The eFrequency was arbitrarily assigned as 0 for all the non-PTS1 peptides and as 1 for those PTS1 tripeptides not in the 3aa library. SUM indicates the summation operator, eFrequenc y aa position $$ \mathrm{eFrequenc}{\mathrm{y}}_{\left[\mathrm{aa},\mathrm{position}\right]} $$ denotes the eFrequency of the 3aa peptides containing the amino acid at a specific position, and Position indicates a specific position in a tripeptide. Numbe r aa position Loss $$ \mathrm{Numbe}{\mathrm{r}}_{\left(\mathrm{aa},\mathrm{position}\right)}^{\mathrm{Loss}} $$ denotes the total number of times when the substitution of one amino acid in a specific position caused loss of peroxisome targeting function, and Numbe r aa position Gain $$ \mathrm{Numbe}{\mathrm{r}}_{\left(\mathrm{aa},\mathrm{position}\right)}^{\mathrm{Gain}} $$ denotes the total number of times when the substitution of an amino acid at a specific position induced gain of peroxisome targeting function. Given the limited size of our training dataset, we initially evaluated several models suitable for smaller datasets rather than more complex models like deep neural networks. To this end, linear regression (LR), random forest (RF), gradient boosting decision tree (GBDT), support vector machine (SVM), and linear discriminant analysis (LDA) were compared using fivefold cross-validation and evaluation metrics including accuracy and recall (Fig. S1). Overall, all five models demonstrated satisfactory performance. SVM was chosen for eValue-based evolutionary information due to its relatively higher accuracy and the fact that it offers robust regularization and generalization capabilities. LDA was chosen for sValue-based experimental information due to its relatively higher recall and the fact that it does not require hyperparameter tuning, making it a straightforward and efficient choice. Both LDA and SVM are well-suited for low-dimensional data, which is three-dimensional in our study, as they exhibit strong performance with fewer hyperparameters to tune and therefore ensuring the efficiency and reliability of the analysis. For the SVM model construction, each 3aa was converted to a three-dimensional vector, where each dimension represents a residue within the 3aa with an amino acid-specific eValue. The model was written in Scikit-learn (v.1.2.5; Pedregosa et al., 2011), opting for the support vector classification (SVC) with a radial basis function (RBF) kernel to balance model complexity with generalization capability. To evaluate the model's performance, a fivefold cross-validation was conducted. The ML-set was divided into five equal parts (or fivefold), each containing the same ratio of PTS1 vs non-PTS1 peptides as in the total ML-set. In each fold, 4/5 of the dataset (294 3aa sequences) served as the training set, while the remaining 1/5 (79 3aa sequences) served as the test set. All the hyperparameters, that is the kernel width and slack-weight regularization parameter, were selected by fivefold cross-validation on the ML-set. All data were standardized by removing the mean and scaling to unit variance to ensure fair training conditions. The model was evaluated by a set of metrics, that is accuracy, recall, and the area under the curve (AUC) values of the receiver operating characteristic (ROC) curves and the precision-recall (PR) curves. After training, the model's predictive accuracy and recall were evaluated based on the number of correct labeling by the test set. This process was repeated through five iterations. To further assess the model's performance, prediction probabilities from SVM were used to plot ROC and PRC curves. After analyzing the model behavior and selecting hyperparameters on the fivefold cross-validation data, the model was retrained on the whole ML-set of 373 3aa peptides. For LDA model construction, each 3aa was converted to a three-dimensional vector, where each dimension represents a residue position with an amino acid-specific sValue. These vectors were normalized to have zero mean and unit variance. The LDA model with default settings was employed to project the data from the three-dimensional space to a one-dimensional form. Similar to the evaluation of SVM, we performed fivefold cross-validation using the ML-set to evaluate the performance of the LDA model. The model was evaluated based on its classification accuracy, recall, and AUC and PRC scores that were averaged over the five iterations. Given that LDA does not produce probabilistic predictions, logistic regression was further employed to classify features extracted by the LDA model. After evaluating the LDA model using the fivefold cross-validation, the model was retrained on the whole ML-set of 373 3aa peptides. The codes for models used in this study are available at https://github.com/Jenelolen/Xiao-Hong-Papers/tree/master/PTS1. To assemble a sizable PTS1 training dataset that contains more sequences besides the 54 peptides previously identified, we selected 85 PTS1-containing and previously validated Arabidopsis peroxisomal proteins (Table S1; Pan et al., 2018). After Blast searches for homologs in the genomes of 362 angiosperms, we assembled a ‘mutual best-match’ dataset of 20 712 proteins, in which a protein was included only if it was identified from the two-way Blast search between Arabidopsis and the species from which the protein was identified (Table S1). The C-terminal tripeptide (3aa) of the proteins was extracted to generate a plant ‘3aa library’ that contains 20 712 tripeptides (Table S1; Fig. 1a). We then selected 299 3aa to experimentally validate their peroxisome targeting ability. These 299 3aa tripeptides included 288 from the 3aa library and 11 that contain SKL (the strongest canonical PTS1) derivatives but absent from the library (Fig. 1a). Most of these 3aa sequences contained amino acids in the 54 previously established PTS1 peptides, that is (T/S/P/C/A/Q) (K/T/S/R/Q/N/M/L/H/G/F/E/D/A/Y/C) (L/M/I/F/V/Y). Specifically, 116 tripeptides contained previously established PTS1 amino acids at all three positions, and 150, 26, and 7 tripeptides contained established PTS1 amino acids at 2, 1, and 0 positions, respectively (Table S3). Each 3aa was fused to the C-terminus of the previously identified AuxP, ‘RILVRTKRPRPR’ (Deng et al., 2022), to support weak PTS1. The resulting AuxP-3aa was then fused to the C-terminus of the yellow fluorescent protein mVenus to generate the mVenus-AuxP-3aa fusion protein (Fig. S2). Using tobacco transient protein expression followed by confocal microscopy, we successfully validated 170 new PTS1 peptides whose mVenus-AuxP fusions showed with the peroxisome moxCerulean3-enhancer-PTS1(SKL) The 3aa (Fig. with 20 tripeptides with 3 was as non-PTS1 (Fig. A of which included 224 functional PTS1 and 149 non-PTS1 sequences, was generated for model training (Fig. We that PTS1 peptides with targeting be more conserved in thus higher of appearance in the 3aa library. To this end, we used the frequency (number of times) of appearance of a 3aa in the 3aa library, as a of the functional of each PTS1 (Table S1). The eFrequency was arbitrarily as 1 for those sequences not in the 3aa library and 0 for all the non-PTS1 was then generated for an amino acid at a specific position of the using the eFrequency of all 3aa peptides containing this amino acid at this specific each 3aa was converted to a three-dimensional vector, where each dimension represents a residue in the tripeptide with an amino acid-specific (Fig. After with several models and the SVM model with a kernel was chosen to classify each 3aa as PTS1 The model was evaluated with cross-validation & on the which demonstrated an accuracy of and an recall of in predicting the peroxisome targeting function (Fig. S1). The ROC curves showed an AUC of indicating a classification and the curves showed an AUC of demonstrating the robust classification performance of the model (Fig. to the use of the is its only PTS1 sequences eFrequency values while all non-PTS1 sequences were arbitrarily as an eFrequency of To provide a more of the of an amino acid at a specific position to the targeting of each we employed an which is based on the number of times that the substitution of an amino acid at a specific position caused loss vs gain of peroxisome targeting (Fig. Each 3aa was then converted to a three-dimensional vector, where each dimension represents a residue of the tripeptide with an amino acid-specific (Fig. After with several models and we the LDA which was able to the three-dimensional space to a one-dimensional and the between PTS1 and non-PTS1 The of the LDA model was using fivefold cross-validation the which an accuracy of and a recall of in predicting the peroxisome targeting (Fig. S1). LDA probabilistic predictions, we used to for the ROC and curve This model demonstrated strong performance in with an ROC and curve of and respectively (Fig. After the SVM and LDA models were retrained on the whole ML-set used to predict PTS1 peptides from all the tripeptide SVM and LDA and PTS1 tripeptides, respectively the 373 3aa peptides in the the remaining tripeptide were into by only by SVM, only by and by models (Fig. To assess the for each and PTS1 tripeptides, we used confocal microscopy of mVenus-AuxP-3aa fusion proteins to test the peroxisome targeting of selected 3aa peptides from each As 1 showed the of (Fig. out of the peroxisome targeting followed by out of for 2, out of for and out of for The of our that the two models can predict all functional PTS1 sequences (Fig. on the we the number of functional PTS1 tripeptides to be 299 in 1, in 2, in and 0 in which a total of peptides in this study (Fig. After the 224 tripeptides from the the total functional plant peptides can (Fig. To the highly diversified PTS1 sequences, we the from two the SVM model utilizing the eValue-based evolutionary information and the LDA model utilizing the sValue-based experimental on our targeting for all the tripeptide the SVM model has of an c. but a recall of c. the LDA model has of a c. but recall of c. These two models each other well and enabled us to the functional plant PTS1 tripeptides from 54 to and the total number of plant PTS1 tripeptides to be (Fig. We that the and of this can be to defining other highly diversified protein motifs. SVM and LDA are suitable for our low-dimensional dataset in this study, they have LDA is a linear model and SVM is a which the features from the training data to be linear for LDA and for the used for SVM is as it is as 0 for all non-PTS1 tripeptides, which may lead to predicting the its in identifying the Moreover, when many tripeptides in the ML-set be they not have tripeptide that has a amino acid substitution the loss gain of peroxisome targeting a to comprehensively the of each amino acid at each position in peroxisome targeting and the prediction of the model. In LDA and SVM are for low-dimensional for our these models may not be applicable to data, where space complexity In higher LDA the data due to its and SVM require more and to its performance. To the for SVM and LDA in motif prediction, more models and data to ensure generalization and different has been known that PTS1 tripeptides in different all have many et al., 2003; Brocard & Hartig, 2006; Deng et al., 2022), a of this has been Our study is a understanding plant PTS1 peptides by the of its Similar to the number of PTS1 tripeptides, the of amino acids that can in each of the three has also been significantly on the previously established 54 PTS1 peptides, the amino acid can be as (T/S/P/C/A/Q) (K/T/S/R/Q/N/M/L/H/G/F/E/D/A/Y/C) (L/M/I/F/V/Y). Our study the amino acids at position and to all 20 amino acids and those at position to indicating the of to different amino acids on the PTS1 tripeptides et al., 2003; Brocard & Hartig, as well as the of the strong upstream auxiliary peptide (Deng et al., 2022). ML-based PTS1 predicting in plants (Lingner et al., 2011; et al., have limited in predicting PTS1 proteins with weak and tripeptides, these PTS1 sequences are and absent from the training The number of PTS1 tripeptides in this study the number of for peroxisomal proteins, therefore can facilitate in identifying new peroxisomal the 20 712 ‘mutual best-match’ proteins of the 85 PTS1-containing and previously validated Arabidopsis peroxisomal proteins identified from 362 angiosperms (Table we found proteins with the validated in this study and proteins with PTS1 tripeptides (Table As an of the of this study may to new peroxisomal proteins, we selected two proteins, and for targeting analysis. Both proteins were previously unknown to be peroxisomal and contain new PTS1 tripeptides with many and in the upstream AuxP which are features known to PTS1 function (Deng et al., 2022). Both the full and the C-terminal can peroxisome targeting for these two proteins (Fig. indicating that they are new peroxisomal proteins in that may previously unknown peroxisomal the of validated PTS1 tripeptides and PTS1 tripeptides in this study can provide a of PTS1 tripeptides, which be used as a to by thus their prediction is important to that in this study all the new PTS1 tripeptides are experimentally in the of a strong upstream with these new PTS1 tripeptides but AuxP may not be into is also that PTS1 tripeptides, a strong AuxP, only to peroxisome targeting, indicating that these tripeptides have weak in peroxisome targeting. the PTS1 tripeptides in and which were by only one model and not experimentally have a relatively than that of 1. of these PTS1 tripeptides may not have peroxisome targeting ability. In the of the new PTS1 tripeptides only be as a but not for peroxisome targeting. is also important to that the of a strong PTS1 is not sufficient for peroxisome targeting. For the PTS1 sequence may be from the by other parts of the a strong targeting signal is with PTS1. In the PTS1 tripeptide may be located of the The Arabidopsis peroxisomal enzymes are of these PTS1 proteins et al., 2011; et al., et al., 2024). For with an established PTS1 tripeptide but a at its N-terminus is to the of the that the PTS1 peroxisome et al., 2024). 1 contains as an which is by an when to its with the can the signal on be and peroxisome et al., when a fusion of a fluorescent protein and a protein containing a strong PTS1 does not localize to peroxisomes, it may be this PTS1 is well conserved in other species, which may be a of its function in peroxisome targeting, at under conditions. This was by the of the of the and and the project and the and conducted in analysis. and performed in silico analysis. and to data analysis. and the and to this Fig. evaluation of different Fig. for construct to the mVenus-AuxP-3aa fusion Fig. of peroxisome targeting analysis of PTS1 peptides for the ML-set Fig. of peroxisome targeting analysis of PTS1 peptides for the ML-set Fig. of peroxisome targeting analysis of non-PTS1 peptides for the ML-set. Fig. targeting analysis of 3aa by models Fig. targeting analysis of 3aa by models Fig. targeting analysis of 3aa by models Fig. targeting analysis of 3aa only by one model. Fig. targeting analysis of two proteins with identified PTS1. Table The ‘mutual best-match’ dataset of plant PTS1 proteins and the plant 3aa library. Table used in this Table 3aa to be validated for the ML-set. Table by the SVM model. Table by the LDA model. Table in Table that contain a validated a PTS1 tripeptide identified in this is not for the of Information by the than be to the The is not for the of information by the than be to the for the