典型相关
偏最小二乘回归
过度拟合
虚假关系
降维
样本量测定
计算机科学
多元统计
人类连接体项目
偏相关
维数之咒
人工智能
相关性
超参数
数据挖掘
机器学习
统计
数学
心理学
人工神经网络
几何学
神经科学
功能连接
作者
Agoston Mihalik,James Chapman,Rick A. Adams,Nils R. Winter,Fabio S. Ferreira,John Shawe-Taylor,Janaina Mourão-Miranda
标识
DOI:10.1016/j.bpsc.2022.07.012
摘要
Canonical correlation analysis (CCA) and partial least squares (PLS) are powerful multivariate methods for capturing associations across 2 modalities of data (e.g., brain and behavior). However, when the sample size is similar to or smaller than the number of variables in the data, standard CCA and PLS models may overfit, i.e., find spurious associations that generalize poorly to new data. Dimensionality reduction and regularized extensions of CCA and PLS have been proposed to address this problem, yet most studies using these approaches have some limitations. This work gives a theoretical and practical introduction into the most common CCA/PLS models and their regularized variants. We examine the limitations of standard CCA and PLS when the sample size is similar to or smaller than the number of variables. We discuss how dimensionality reduction and regularization techniques address this problem and explain their main advantages and disadvantages. We highlight crucial aspects of the CCA/PLS analysis framework, including optimizing the hyperparameters of the model and testing the identified associations for statistical significance. We apply the described CCA/PLS models to simulated data and real data from the Human Connectome Project and Alzheimer's Disease Neuroimaging Initiative (both of n > 500). We use both low- and high-dimensionality versions of these data (i.e., ratios between sample size and variables in the range of ∼1-10 and ∼0.1-0.01, respectively) to demonstrate the impact of data dimensionality on the models. Finally, we summarize the key lessons of the tutorial.
科研通智能强力驱动
Strongly Powered by AbleSci AI