An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models

推论 计算机科学 变压器 人工智能 电气工程 工程类 电压
作者
Sangsoo Park,Kyung-Soo Kim,Jinin So,Jin Chul Jung,Jong-Geon Lee,Kyoungwan Woo,Nayeon Kim,Younghyun Lee,Hyungyo Kim,Yongsuk Kwon,Jinhyun Kim,Jieun Lee,Yeongon Cho,Yong-Min Tai,Jeonghyeon Cho,Hoyoung Song,Jung Ho Ahn,Nam Sung Kim
标识
DOI:10.1109/hpca57654.2024.00078
摘要

Transformer-based large language models (LLMs) such as Generative Pre-trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-to-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
小蘑菇应助一一一采纳,获得10
刚刚
2秒前
4秒前
笙笙发布了新的文献求助10
4秒前
勤奋映之完成签到 ,获得积分10
4秒前
Lucas应助关天木采纳,获得10
6秒前
9秒前
cdercder应助科研通管家采纳,获得10
11秒前
赘婿应助科研通管家采纳,获得10
11秒前
cdercder应助科研通管家采纳,获得10
11秒前
11秒前
wy.he应助科研通管家采纳,获得30
11秒前
Hello应助科研通管家采纳,获得10
11秒前
cdercder应助科研通管家采纳,获得10
11秒前
11秒前
11秒前
姜sir完成签到 ,获得积分10
11秒前
12秒前
怕孤独的修杰完成签到 ,获得积分10
12秒前
犹豫野狼完成签到 ,获得积分10
12秒前
chenu给chenu的求助进行了留言
13秒前
研友_Lmg01Z发布了新的文献求助10
13秒前
superLmy完成签到 ,获得积分10
14秒前
14秒前
14秒前
14秒前
15秒前
17秒前
关天木发布了新的文献求助10
19秒前
自信的冬日完成签到 ,获得积分10
19秒前
犇骉发布了新的文献求助10
19秒前
cccr02发布了新的文献求助10
20秒前
1111完成签到,获得积分10
20秒前
小蘑菇应助103921wjk采纳,获得10
20秒前
动漫大师发布了新的文献求助10
21秒前
加油驳回了iNk应助
22秒前
23秒前
笙笙完成签到,获得积分10
26秒前
科研通AI5应助昭谏采纳,获得10
28秒前
30秒前
高分求助中
【此为提示信息,请勿应助】请按要求发布求助,避免被关 20000
Continuum Thermodynamics and Material Modelling 2000
Encyclopedia of Geology (2nd Edition) 2000
105th Edition CRC Handbook of Chemistry and Physics 1600
Maneuvering of a Damaged Navy Combatant 650
Периодизация спортивной тренировки. Общая теория и её практическое применение 310
Mixing the elements of mass customisation 300
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3779613
求助须知:如何正确求助?哪些是违规求助? 3325127
关于积分的说明 10221318
捐赠科研通 3040220
什么是DOI,文献DOI怎么找? 1668678
邀请新用户注册赠送积分活动 798766
科研通“疑难数据库(出版商)”最低求助积分说明 758535