计算机科学
服务器
延迟(音频)
推论
吞吐量
试验台
分布式计算
深度学习
实时计算
计算机工程
计算机网络
人工智能
操作系统
电信
无线
作者
Di Liu,Zimo Ma,Aolin Zhang,Kuangyu Zheng
标识
DOI:10.1109/mass58611.2023.00074
摘要
Recent rapid development in deep learning (DL) applications generates harsh requirements for DL inference services provided by GPU servers. On one hand, a high volume of different DL workloads always demands better processing throughput. On the other hand, GPU servers need to meet both the constraints of latency and power: each inference request must be responded in real-time with strict latency requirements; GPU servers need to be operated within a fixed power cap to prevent system failures from power overloading or overheating. Therefore, how to efficiently manage GPU resources to achieve better throughput under both latency and power constraints has become a key challenge.To address this issue, we first perform comprehensive measurements of inference tasks and have studied the impact of several critical knobs, including batch size, frequency, and GPU spatial sharing, on system performance in throughput, latency, and power. Then, we propose Morak, a multi-knob resource management framework for DL inference under the constraints of latency and power. A key mechanism of Morak is GPU resource partitioning with efficient space multiplexing for DL models. To further improve throughput, Morak efficiently explores the search space of GPU frequency and batch size under the constraints. Experiment results on a hardware testbed show that Morak can achieve as much as 67.7% throughput improvement compared with several state-of-the-art baselines under tight constraints of latency and power.
科研通智能强力驱动
Strongly Powered by AbleSci AI