量化(信号处理)
计算机科学
算术
算法
计算机硬件
数学
计算机工程
作者
Jiawei Xu,Jiangshan Fan,Baolin Nan,Chen Ding,Li‐Rong Zheng,Zhuo Zou,Yuxiang Huan
出处
期刊:IEEE Transactions on Circuits and Systems I-regular Papers
[Institute of Electrical and Electronics Engineers]
日期:2023-10-02
卷期号:70 (12): 5380-5393
标识
DOI:10.1109/tcsi.2023.3315299
摘要
Post-training quantization (PTQ) has been proven an efficient model compression technique for Convolution Neural Networks (CNNs), without re-training or access to labeled datasets. However, it remains challenging for a CNN accelerator to fulfill the efficiency potential of PTQ methods. A large number of PTQ techniques blindly pursue high theoretic compression effect and accuracy, ignoring their impact on the actual hardware implementation, which causes more hardware overhead than benefit. This paper introduces ASLog, a PTQ-friendly CNN accelerator that explores four key designs in an algorithm-hardware co-optimizing manner: the first practical 4-bit logarithmic PTQ pipeline SLogII, the multiplier-free arithmetic element (AE) design, the energy-efficient bias correction element (BCE) design, and the per-channel quantization friendly (PCF) architecture and dataflow. The proposed SLogII PTQ pipeline can push the limit of logarithmic PTQ to 4-bit with $<$ 2.5% accuracy degradation on various image classification and face recognition tasks. Exploiting the approximate computing design and a novel encoding and decoding scheme, the proposed SLogII AE is $>$ 40% lower in power and area consumption compared with a common 8-bit multiplier. The BCE and PCF design proposed in this paper are the first to consider the hardware impact of the widely-used per-channel quantization and bias correction technique, enabling an efficient PTQ-friendly implementation with a small hardware overhead. The ASLog is validated in a UMC 40-nm process, with 12.2 TOPS/W energy efficiency and 0.80 mm $^2$ core area. The ASLog can achieve 336.3 GOPS/mm $^2$ area efficiency and $>$ 500 OPs/Byte operational intensity, which map to over 1.85 $\times$ and 1.12 $\times$ improvement compared with the previous related works.
科研通智能强力驱动
Strongly Powered by AbleSci AI