Deep-Learning-Guided Mining and Clustering of Remote Amino Acid Residues for the Simultaneous Engineering of the Catalytic Activity and Thermostability of a Processive Endoglucanase
Processive endoglucanases, which possess both endo- and exoglucanase activities, are considered highly promising catalysts in cellulose degradation. In this study, we employed multiple deep learning models, including MutCompute, DeepSequence, and ESM-1v, to guide the engineering of EG5C-1, a processive endoglucanase derived from Bacillus subtilis BS-5. This enabled a systematic exploration of the enzyme's sequence space. Through a combination of clustering analysis and a greedy algorithm, we optimized combinations of amino acid substitutions and ultimately identified an elite variant, M8 (R23Q/E43Q/K91I/K191P/A198T/Q237D/V240P/S245A), composed entirely of substituted residues. Compared to the wild-type enzyme, M8 exhibited 10-fold and 5-fold improvements in catalytic efficiency (kcat/Km) toward soluble substrate carboxymethyl cellulose-Na (CMC) and insoluble substrate phosphoric acid-swollen cellulose (PASC), respectively, along with enhanced optimal temperature and thermostability. Molecular mechanistic analyses revealed that all distal substituted residues enhanced dynamic coupling and coordination, primarily influencing the conformation of three loops near the substrate pocket. These structural changes modulated substrate binding and product release, thereby contributing to improved catalytic efficiency (kcat/Km). This work not only suggests a feasible strategy to explore the "dark space" within sequences but also provides insights into the practical application of machine learning in experiments.