超参数
计算机科学
水准点(测量)
标杆管理
人工智能
机器学习
树(集合论)
集合(抽象数据类型)
深度学习
原始数据
数据集
点(几何)
数据挖掘
数学
几何学
数学分析
业务
营销
大地测量学
程序设计语言
地理
作者
Léo Grinsztajn,Edouard Oyallon,Gaël Varoquaux
出处
期刊:Cornell University - arXiv
日期:2022-07-18
被引量:10
标识
DOI:10.48550/arxiv.2207.08815
摘要
While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.
科研通智能强力驱动
Strongly Powered by AbleSci AI