作者
Changjin Yuan,Bin Wang,Hong Wang,Fang Wang,Xiangze Li,Yanan Zhen
摘要
T-cell receptor (TCR) repertoires provide insights into tumor immunology, yet their variations across digestive system cancers are not well understood. Characterizing TCR differences between colorectal cancer (CRC) and gastric cancer (GC), as well as developing machine learning models to distinguish cancer types, metastatic status, and disease stages are crucial for guiding clinical practices. A cohort study of 143 tumor patients (96 CRC, 47 GC) was conducted. High-throughput TCR sequencing was performed to capture TCR beta (TRB), delta (TRD), and gamma (TRG) chain data. Tissue-specific patterns in TCR repertoire features, such as V-J gene recombination, complementarity-determining region 3 (CDR3) sequences, and motif distributions, were analyzed. Multi-layer machine learning-based diagnostic models were developed by leveraging motif-based feature and deep learning-based feature extraction using ProteinBERT from the 100 most abundant CDR3 sequences per sample. These models were used to differentiate CRC from GC, distinguish between primary and metastatic CRC lesions, and predict disease stages in CRC. Tissue-specific differences in TCR repertoires were observed across CRC, GC, and between primary and metastatic lesions, as well as across disease stages in CRC. Distinct V-J gene recombination patterns were identified, with CRC showing enrichment in TRBV*-TRBJ* combinations, while GC exhibited higher levels of γδT-cell-related recombination. Primary and metastatic lesions of CRC patients displayed distinct V-J recombination preferences (e.g., TRBV7-9/TRBJ2-1 higher in metastatic; TRBV20-1/TRBJ1-2 higher in primary) and CDR3 sequence differences, with metastatic having shorter TRG CDR3 lengths (p-value = 0.019). Across CRC stages, later stages (III-IV) showed higher clonal diversity (p-value < 0.05) and stage-specific V-J patterns, alongside distinct CDR3 amino acid preferences at N-terminal (positions 1-2) and central positions (positions 5-12). Multi-dimensional machine learning models demonstrated exceptional diagnostic performance across all classification tasks. For distinguishing CRC from GC, the model achieved an accuracy of 97.9% and an area under the curve (AUC) of 0.996. For differentiating primary from metastatic CRC, the model achieved 100% accuracy with an AUC of 1.000. In predicting CRC disease stages, the model attained an accuracy of 96.9% and an AUC of 0.993. Extensive validation using simulated and publicly available datasets, confirmed the robustness and reliability of the models, demonstrating consistent performance across diverse datasets and experimental conditions. Our investigation provides novel insights into TCR repertoire variations in digestive system tumors, and highlight the potential of immune repertoire features as powerful diagnostic tools for understanding cancer progression and potentially improving clinical decision-making.