计算机科学
加速
可扩展性
并行计算
内存带宽
德拉姆
延迟时间
嵌入
推论
分布式计算
计算机体系结构
操作系统
人工智能
计算机硬件
内存控制器
半导体存储器
作者
Liu Ke,Udit Gupta,Benjamin Youngjae Cho,David Brooks,Vikas Chandra,Utku Diril,Amin Firoozshahian,Kim Hazelwood,Bill Jia,Hsien-Hsin S. Lee,Meng Li,Bert Maher,Dheevatsa Mudigere,Maxim Naumov,Martin Schatz,Mikhail Smelyanskiy,Xiaodong Wang,Brandon Reagen,Carole-Jean Wu,Mark Hempstead
标识
DOI:10.1109/isca45697.2020.00070
摘要
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software cooptimization techniques such as memory-side caching, tableaware packet scheduling, and hot entry profiling are studied, providing up to 9.8× memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2× throughput improvement and 45.8% memory energy savings.
科研通智能强力驱动
Strongly Powered by AbleSci AI