稳健性(进化)
计算机科学
机器学习
人工智能
决策树
特征学习
多样性(控制论)
特征工程
训练集
嵌入
特征(语言学)
代表(政治)
班级(哲学)
树(集合论)
数据挖掘
集成学习
外部数据表示
稀缺
深度学习
强化学习
可靠性(半导体)
特征选择
药物发现
钥匙(锁)
知识抽取
标记数据
作者
Woruo Chen,Yao Tian,Youchao Deng,Dejun Jiang,Dongsheng Cao
标识
DOI:10.26434/chemrxiv-2025-szk5s
摘要
Early-stage drug discovery often suffers from data scarcity and out-of-distribution (OOD) shifts, which constrain the reliability of predictive models. While deep learning has advanced representation learning from molecular and biological data, tabular modeling remains indispensable, particularly in small-sample and OOD scenarios. For over a decade, gradient-boosted decision trees (GBDTs) such as XGBoost have been the dominant choice, yet their robustness is limited under such conditions. TabPFN, a recently introduced transformer-based tabular foundation model, enables accurate predictions on small datasets without task-specific retraining. Applying TabPFN to a variety of molecular data sets, we find that TabPFN performs on par with XGBoost in classification, but demonstrates clear and stable advantages in regression, with its strongest gains on small and medium datasets and under OOD evaluations. Feature and data ablations (10–90%) further highlight its robustness, as performance degrades gracefully and exhibits minimal sensitivity compared with tree ensembles. On quantum tasks, TabPFN shows competitive accuracy on QM7 but is challenged by the larger QM8 dataset, where tree ensembles regain strength. Beyond metrics, embedding analyses indicate smoother structure–property relationships of TabPFN and enhanced class separability, reflecting beneficial inductive biases rather than overfitting. Collectively, these findings demonstrate that TabPFN offers a robust and data-efficient alternative for tabular learning in drug discovery, shedding new light on predictive modeling under small-data and OOD challenges.
科研通智能强力驱动
Strongly Powered by AbleSci AI