云计算
计算机科学
容器(类型理论)
重组
资源(消歧)
分布式计算
资源管理(计算)
领域(数学)
作业车间调度
人工智能
操作系统
计算机网络
机械工程
地铁列车时刻表
数学
财务
纯数学
工程类
经济
作者
Ying Mao,Sharma Vp,Wenjia Zheng,Long Cheng,Qiang Guan,Ang Li
出处
期刊:IEEE Transactions on Cloud Computing
[Institute of Electrical and Electronics Engineers]
日期:2023-04-01
卷期号:11 (2): 2204-2216
被引量:5
标识
DOI:10.1109/tcc.2022.3194128
摘要
The increasing demand for learning from massive datasets is restructuring our economy. Effective learning, however, involves nontrivial computing resources. Most businesses utilize commercial infrastructure providers (e.g., AWS) to host their computing clusters in the cloud, where various jobs compete for available resources. While cloud resource management is a fruitful research field that has made many advances in production, such as Kubernetes and YARN, few efforts have been invested to further optimize the system performance, especially for Deep Learning (DL) training jobs in a container cluster. This work introduces FlowCon, a system that is able to monitor the individual evaluation functions of DL jobs at runtime, and thus to make placement decisions and resource allocations elastically. We present a detailed design and implementation of FlowCon and conduct intensive experiments over various DL models. The results demonstrate that FlowCon significantly improves DL job completion time and resource utilization efficiency when compared to default systems. According to the results, FlowCon can improve the completion time by up to 68.8% and meanwhile, reduce the makespan by 18.0%, in the presence of various DL job workloads.
科研通智能强力驱动
Strongly Powered by AbleSci AI