延迟(音频)
计算机科学
推论
微服务
操作系统
人工智能
云计算
电信
作者
Kenji Tanaka,Yuki Arikawa,Kazutaka Morita,Tsuyoshi Ito,Takashi Uchida,Natsuko Saito,Shinya Kaji,Takeshi Sakamoto
标识
DOI:10.1109/hcs55958.2022.9895617
摘要
VTA-NIC Chip ArchitectureWe aim to achieve DL inference serving (DLIS) without CPU interference.We integrate hardware data paths as a NIC (Network Interface Card), a REST API parser/deparser and multiple VTAs (Versatile Tensor Accelerators).ConfigurationVTA-NICProcess node16 nm FinFET @Xilinx FPGANumber of Cores8 VTA CoresCore Frequency213 MHzMACs per core169Memory Throughput19.2GB/s (DDR4-2400)Number precisionINT8PerformancePower EfficiencyThe DLIS power efficiency of VTA-NIC is 6.1x better than that of GPU (Nvidia V100).Tail LatencyAt high load, the tail latency of heterogeneous systems unexpectedly increases. With our chip, the tail latency is predictable since it is proportional to the load.BackgroundRecently, web applications are often built on microservices.DL Inference Serving (DLIS) is one of those microservices 1 .DLIS is provisioned with a special accelerator instance 2 .The microservices/instances are loosely coupled via APIs.BackgroundAccelerator instances risk inefficient data movement.1.Moving data via host processors decreases the accelerator's utilization. 3 a.In our preliminary experiments, half of the DLIS latency was caused by moving data.2.Under high-load conditions, the interference of host processors degrades DLIS tail latency by up to 100 times 4 .a.In the real cloud, 9% of light DLIS tasks suffer server tail latency, and half of the serving time is waiting time. 5 Model: ResNet-18@TensorRTPrecision: INT8System: Triton Inference ServerAccelerator: Nvidia V100
科研通智能强力驱动
Strongly Powered by AbleSci AI