数据流
现场可编程门阵列
延迟(音频)
计算机科学
嵌入式系统
计算机体系结构
计算机硬件
并行计算
电信
作者
M. Kim,Kyoungseok Oh,Youngmock Cho,H.S. Seo,Xuan Truong Nguyen,Hyuk‐Jae Lee
出处
期刊:IEEE Transactions on Circuits and Systems I-regular Papers
[Institute of Electrical and Electronics Engineers]
日期:2023-12-14
卷期号:71 (3): 1158-1171
被引量:7
标识
DOI:10.1109/tcsi.2023.3335949
摘要
Object detection models have demonstrated outstanding performance in terms of accuracy. However, mapping convolutional neural network-based object-detection models to memory and computing-constrained devices is still challenging, which commonly leads to accuracy degradation and long latency. To address the problem, this work presents a design methodology to map the YOLOv3-tiny model onto a small FPGA board, in this case the Nexys A7-100T, which only has 0.5 MB on-chip SRAM and 240 DSPs. First, we design four identical MAC arrays to maximize the throughput by utilizing both DSPs and LUTs. Second, to exploit the MACs fully, we propose a dynamic data reuse scheme that handles inter-layer and intra-layer executions effectively under a small on-chip SRAM footprint. To this end, the proposed accelerator achieves an inference speed of 76.75 frames per second and throughput of 95.08 GOPs at 100MHz and consumes power of 2.203W. Specifically, it achieves a hardware utilization rate of 82.53%, thus significantly outperforming current YOLOv3-tiny accelerators.
科研通智能强力驱动
Strongly Powered by AbleSci AI