计算机科学
加速
并行计算
图形处理单元
卷积神经网络
硬件加速
核(代数)
计算科学
人工智能
算法
计算机硬件
现场可编程门阵列
数学
组合数学
作者
Dingkun Yang,Zhiyong Luo
标识
DOI:10.1109/jiot.2023.3277869
摘要
In the field of machine vision and pattern recognition, the convolutional neural network (CNN) is one of the hottest research topics. However, the application of CNN, which requires complicated operations, appears to be exceptionally difficult in resource-constrained embedded devices. In this article, a parallel processing CNN accelerator based on optimized MobileNet is presented. By modifying the fully connected layer of the MobileNet network topology with the convolution process, and postponing the global pooling layer, the model topology is unified, which is conducive to the design of the hardware accelerator. After using the 8-bit quantization strategy of network model parameters, the process of depthwise separable convolution is accelerated by parallel processing between channels and pipelined processing between layers. Thus, the processing speed and throughput of the accelerator can be improved. The designed accelerator classification performance on ImageNet achieved 580.6 frames per second (fps) on a ZYNQ AXZU5EV platform and a system power consumption of only 6.51 W. This result represents a $22.3\times $ speedup compared to CPU and a $1.7\times $ speedup compared to graphics processing unit (GPU), while the design has a lower power consumption than CPU and GPU, providing a reference for the application of CNN in embedded devices.
科研通智能强力驱动
Strongly Powered by AbleSci AI