计算机科学
推论
并行计算
吞吐量
延迟(音频)
软件
深度学习
库达
图形处理单元的通用计算
计算机体系结构
人工智能
程序设计语言
绘图
操作系统
电信
无线
作者
Eun-Jin Jeong,Jangryul Kim,Samnieng Tan,Jae-Seong Lee,Soonhoi Ha
标识
DOI:10.1109/les.2021.3087707
摘要
As deep learning (DL) inference applications are increasing, an embedded device tends to equip neural processing units (NPUs) in addition to a CPU and a GPU. For fast and efficient development of DL applications, TensorRT is provided as the software development kit for the NVIDIA hardware platform, including optimizer and runtime that delivers low latency and high throughput for DL inference. Like most DL frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU, or NPU, not both. In this letter, we propose a parallelization methodology to maximize the throughput of a single DL application using both GPU and NPU by exploiting various types of parallelism on TensorRT. With six real-life benchmarks, we could achieve 81%–391% throughput improvement over the baseline inference using GPU only.
科研通智能强力驱动
Strongly Powered by AbleSci AI