Softmax函数
计算机科学
可扩展性
变压器
二次方程
自编码
先验概率
机器学习
人工智能
算法
理论计算机科学
模式识别(心理学)
数学
深度学习
贝叶斯概率
数据库
量子力学
几何学
物理
电压
作者
Krzysztof Choromański,Valerii Likhosherstov,David Dohan,Xingyou Song,Andreea Gane,Tamás Sarlós,Peter Hawkins,Jared Quincy Davis,Afroz Mohiuddin,Łukasz Kaiser,David Belanger,Lucy J. Colwell,Adrian Weller
出处
期刊:Cornell University - arXiv
日期:2020-09-30
被引量:122
标识
DOI:10.48550/arxiv.2009.14794
摘要
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
科研通智能强力驱动
Strongly Powered by AbleSci AI