Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters

计算机科学 调度(生产过程) 人气 作业车间调度 服务器 人工智能 云计算 分布式计算 机器学习 作业调度程序 计算机网络 数学优化 布线(电子设计自动化) 操作系统 社会心理学 数学 心理学
作者
Tsung‐Hsin Hsieh,Che-Rung Lee
标识
DOI:10.1109/ic2e59103.2023.00023
摘要

Nowadays, machine learning has become an indispensable service for cloud providers. Elastic training, a novel training paradigm that dynamically adjusts resource allocation for a group of training jobs, has gained popularity due to its ability to effectively utilize accelerators, which are essential for training a massive number of deep learning models. Despite the existence of numerous scheduling algorithms for elastic training, most lack an easy-to-use yet efficient platform to execute them. In this paper, we present Voda, a GPU scheduling platform for elastic deep learning. In contrast to prior approaches that employ parameter servers for elastic training, Voda is designed for AllReduce-style communication, which proves to be more effective, albeit more complex to adjust. Voda, built on top of Kubernetes, consists of a set of loosely coupled components that collect runtime information, dynamically alter the resource allocation, and optimize job placement based on communication costs among underlying GPUs. We implement and compare four scheduling algorithms for elastic training, including three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival patterns. Experimental results demonstrate that no single algorithm dominates all performance metrics, such as average job completion time, running time, or makespan. However, certain algorithms outperform others under specific workloads and job distributions. Additionally, our experiments highlight the significance of job placement in GPU clusters, and our proposed method effectively optimizes communication costs among different workers of a job.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
benben应助可爱凯采纳,获得10
1秒前
啵子发布了新的文献求助10
1秒前
盛事不朽完成签到 ,获得积分10
3秒前
青柠完成签到,获得积分10
4秒前
wild_cube完成签到 ,获得积分10
5秒前
情怀应助遇到困难睡大觉采纳,获得10
5秒前
9秒前
派大星4822完成签到,获得积分10
9秒前
xiaoxiaoliang发布了新的文献求助10
10秒前
撞飞整个世界的小海狸完成签到,获得积分10
11秒前
123发布了新的文献求助10
11秒前
1111应助123456采纳,获得10
13秒前
不倦应助诸笑白采纳,获得10
13秒前
毅梦发布了新的文献求助10
14秒前
砸锅卖铁去上学完成签到,获得积分10
14秒前
14秒前
14秒前
maph完成签到,获得积分10
15秒前
cnalb完成签到 ,获得积分10
15秒前
fox发布了新的文献求助10
15秒前
温暖友易完成签到,获得积分10
16秒前
美丽傲霜关注了科研通微信公众号
18秒前
Bobo发布了新的文献求助10
20秒前
juice发布了新的文献求助10
22秒前
鲤角兽发布了新的文献求助150
23秒前
23秒前
寻道图强应助科研通管家采纳,获得20
23秒前
ding应助科研通管家采纳,获得10
23秒前
Hello应助科研通管家采纳,获得10
23秒前
顾矜应助科研通管家采纳,获得10
23秒前
秋雪瑶应助科研通管家采纳,获得10
23秒前
充电宝应助科研通管家采纳,获得10
23秒前
123完成签到,获得积分10
23秒前
隐形曼青应助科研通管家采纳,获得10
23秒前
英俊的铭应助科研通管家采纳,获得10
24秒前
乐乐应助Pandaer采纳,获得10
24秒前
罗_应助科研通管家采纳,获得10
24秒前
24秒前
慕青应助科研通管家采纳,获得10
24秒前
Hello应助科研通管家采纳,获得10
24秒前
高分求助中
Manual of Clinical Microbiology, 4 Volume Set (ASM Books) 13th Edition 1000
Teaching Social and Emotional Learning in Physical Education 900
Boris Pesce - Gli impiegati della Fiat dal 1955 al 1999 un percorso nella memoria 500
Chinese-English Translation Lexicon Version 3.0 500
Recherches Ethnographiques sue les Yao dans la Chine du Sud 500
Two-sample Mendelian randomization analysis reveals causal relationships between blood lipids and venous thromboembolism 500
[Lambert-Eaton syndrome without calcium channel autoantibodies] 460
热门求助领域 (近24小时)
化学 材料科学 医学 生物 有机化学 工程类 生物化学 纳米技术 物理 内科学 计算机科学 化学工程 复合材料 遗传学 基因 物理化学 催化作用 电极 光电子学 量子力学
热门帖子
关注 科研通微信公众号,转发送积分 2398401
求助须知:如何正确求助?哪些是违规求助? 2099695
关于积分的说明 5293027
捐赠科研通 1827470
什么是DOI,文献DOI怎么找? 910891
版权声明 560061
科研通“疑难数据库(出版商)”最低求助积分说明 486908