摘要
A prominent application of machine learning in therapeutic antibody design is the development of models that can generate or screen antibody candidates with a high probability of success in manufacturing and clinical trials. These models must accurately represent sequence-structure-function relationships, also known as the fitness landscape. Previous protein function benchmarks examine fitness landscapes across diverse protein families, but they exclude antibody data. Here, we introduce the second iteration of the Fitness Landscape for Antibodies (FLAb2), the largest public therapeutic antibody design benchmark to date. The datasets collected in FLAb2 contain developability assay data for over 4M antibodies across 32 studies, encompassing seven properties of therapeutic antibodies: thermostability, expression, aggregation, binding affinity, pharmacokinetics, polyreactivity, and immunogenicity. Using the curated data, we evaluate the performance of 30 artificial intelligence (AI) and biophysical models in learning these properties. Protein AI models on average do not produce statistically significant correlations for most (80%) of developability datasets. No models correlate with all properties or across multiple datasets of similar properties. Zero-shot predictions from pretrained models are incapable of accurately predicting all developability properties, although several models (IgLM, ProGen2, Chai-1, ESM2, ISM, IgFold) produce statistically significant correlations for multiple datasets for thermostability, expression, binding, or immunogenicity. Fine-tuning with at least 10^2 points improves performance on thermostability, aggregation, and binding, but polyreactivity and pharmacokinetics lack enough data for significance. Yet it is humbling to observe that given enough developability data (10^3 points), a fine-tuned one-hot encoding model can match the performance of fine-tuned billion-parameter pretrained models. Training data composition influences performance more than model architecture, and intrinsic biophysical properties (thermostability) are more readily learned than extrinsic properties (immunogenicity, pharmacokinetics). Controlling for germline distance with partial correlation reveals that protein language models draw substantially on evolutionary signal; on average, germline edit distance accounts for 40% of their apparent predictive power. FLAb2 data are accessible at https://github.com/Graylab/FLAb, together with scripts that allow researchers to benchmark, compare, and iteratively improve new AI-based developability prediction models.