浮点型
计算机科学
计算
吞吐量
推论
整数(计算机科学)
并行计算
算法
计算机工程
GSM演进的增强数据速率
计算科学
计算机硬件
人工智能
程序设计语言
操作系统
无线
作者
Haikang Diao,Haoyang Luo,Jiahao Song,Bocheng Xu,Runsheng Wang,Yuan Wang,Xiyuan Tang
出处
期刊:IEEE Journal of Solid-state Circuits
[Institute of Electrical and Electronics Engineers]
日期:2025-01-07
卷期号:60 (9): 3403-3415
被引量:1
标识
DOI:10.1109/jssc.2024.3522304
摘要
With the rapid advancement of edge AI, the complexity of tasks on edge devices is continually increasing, demanding better efficiency and precision from AI accelerators. Pre-aligned floating-point computing-in-memory (FP CIM) has been proposed to achieve high-precision neural network (NN) computations based on floating-point (FP) data precision. However, the complex digital circuitry required for integer (INT) mantissa multiply-accumulate (MAC) computation and exponent alignment severely limits the efficiency and throughput of FP CIM. This work proposes an energy-and area-efficient computing-in-memory (CIM) engine for one-shot FP NN inference and on-device fine-tuning. To improve the throughput of FP CIM, a one-shot compute scheme is proposed to perform FP operation within one cycle. It adopts the multiply-less NN instead of the multiply-based NN to simplify the integer mantissa MAC to minimum selection. A customized 8-bit parallel minimum selector is also designed to further reduce the parallel computation cost. To simplify the FP/INT conversion process, an input–weight co-alignment workflow is proposed to eliminate maximum exponent selection and simplify mantissa shifting logic. To minimize the inference accuracy loss caused by environmental changes, a lightweight on-device fine-tuning core (ODFC) is designed to support online weight updates. The 28-nm fabricated chip achieves an energy efficiency of 128 TFLOPS/W and a computational density of 7.02 TFLOPS/mm $^2$ at BF16, representing a 4.1 $\times$ and 3.4 $\times$ improvement over previous state-of-the-art works, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI