发布文献求助

Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters

计算机科学调度（生产过程）人气作业车间调度服务器人工智能云计算分布式计算机器学习作业调度程序计算机网络数学优化布线（电子设计自动化）操作系统社会心理学数学心理学

作者

Tsung‐Hsin Hsieh,Che-Rung Lee

标识

DOI：10.1109/ic2e59103.2023.00023

摘要

Nowadays, machine learning has become an indispensable service for cloud providers. Elastic training, a novel training paradigm that dynamically adjusts resource allocation for a group of training jobs, has gained popularity due to its ability to effectively utilize accelerators, which are essential for training a massive number of deep learning models. Despite the existence of numerous scheduling algorithms for elastic training, most lack an easy-to-use yet efficient platform to execute them. In this paper, we present Voda, a GPU scheduling platform for elastic deep learning. In contrast to prior approaches that employ parameter servers for elastic training, Voda is designed for AllReduce-style communication, which proves to be more effective, albeit more complex to adjust. Voda, built on top of Kubernetes, consists of a set of loosely coupled components that collect runtime information, dynamically alter the resource allocation, and optimize job placement based on communication costs among underlying GPUs. We implement and compare four scheduling algorithms for elastic training, including three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival patterns. Experimental results demonstrate that no single algorithm dominates all performance metrics, such as average job completion time, running time, or makespan. However, certain algorithms outperform others under specific workloads and job distributions. Additionally, our experiments highlight the significance of job placement in GPU clusters, and our proposed method effectively optimizes communication costs among different workers of a job.

求助该文献

最长约 10秒，即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI

我的文献求助列表浏览历史

一分钟了解求助规则 | 捐赠本站 | 论文查重

更新

大幅提高文件上传限制，最高150M (2024-4-1)

更新

新增期刊收藏功能 (2024-03-23)

科研通是完全免费的文献互助平台，具备全网最快的应助速度，最高的求助完成率。对每一个文献求助，科研通都将尽心尽力，给求助人一个满意的交代。

实时播报: benben的应助被可爱凯采纳，获得10

1秒前; 啵子发布了新的文献求助10

1秒前; 盛事不朽完成签到，获得积分10

3秒前; 青柠完成签到，获得积分10

4秒前; wild_cube完成签到，获得积分10

5秒前; 情怀的应助被遇到困难睡大觉采纳，获得10

5秒前; 汉堡包上传了应助文件

9秒前; 派大星4822完成签到，获得积分10

9秒前; xiaoxiaoliang发布了新的文献求助10

10秒前; 撞飞整个世界的小海狸完成签到，获得积分10

11秒前; 123发布了新的文献求助10

11秒前; 1111的应助被123456采纳，获得10

13秒前; 不倦的应助被诸笑白采纳，获得10

13秒前; 毅梦发布了新的文献求助10

14秒前; 砸锅卖铁去上学完成签到，获得积分10

14秒前; 李健的粉丝团团长的应助被fengw420采纳，获得10

14秒前; 稳重的若雁上传了应助文件

14秒前; maph完成签到，获得积分10

15秒前; cnalb完成签到，获得积分10

15秒前; fox发布了新的文献求助10

15秒前; 温暖友易完成签到，获得积分10

16秒前; 美丽傲霜关注了科研通微信公众号

18秒前; Bobo发布了新的文献求助10

20秒前; juice发布了新的文献求助10

22秒前; 鲤角兽发布了新的文献求助150

23秒前; 酸化土壤改良的应助被科研通管家采纳，获得10

23秒前; 寻道图强的应助被科研通管家采纳，获得20

23秒前; ding的应助被科研通管家采纳，获得10

23秒前; Hello的应助被科研通管家采纳，获得10

23秒前; 顾矜的应助被科研通管家采纳，获得10

23秒前; 秋雪瑶的应助被科研通管家采纳，获得10

23秒前; 充电宝的应助被科研通管家采纳，获得10

23秒前; 123完成签到，获得积分10

23秒前; 隐形曼青的应助被科研通管家采纳，获得10

23秒前; 英俊的铭的应助被科研通管家采纳，获得10

24秒前; 乐乐的应助被Pandaer采纳，获得10

24秒前; 罗_的应助被科研通管家采纳，获得10

24秒前; 万能图书馆的应助被科研通管家采纳，获得10

24秒前; 慕青的应助被科研通管家采纳，获得10

24秒前; Hello的应助被科研通管家采纳，获得10

24秒前

高分求助中: Manual of Clinical Microbiology, 4 Volume Set (ASM Books) 13th Edition 1000; Teaching Social and Emotional Learning in Physical Education 900; Boris Pesce - Gli impiegati della Fiat dal 1955 al 1999 un percorso nella memoria 500; Chinese-English Translation Lexicon Version 3.0 500; Recherches Ethnographiques sue les Yao dans la Chine du Sud 500; Two-sample Mendelian randomization analysis reveals causal relationships between blood lipids and venous thromboembolism 500; [Lambert-Eaton syndrome without calcium channel autoantibodies] 460

热门求助领域（近24小时）

热门帖子: 关注科研通微信公众号，转发送积分 2398401; 求助须知：如何正确求助？哪些是违规求助？ 2099695; 关于积分的说明 5293027; 捐赠科研通 1827470; 什么是DOI，文献DOI怎么找？ 910891; 版权声明 560061; 科研通“疑难数据库（出版商）”最低求助积分说明 486908

今日热心研友

紫金大萝卜

坚强的广山

酸化土壤改良

个性的紫菜

注：热心度 = 本日应助数 + 本日被采纳获取积分÷10

Copyright © 2020-2024 AbleSci.COM, 科研通, All Right Reserved

科研通是非营利科研互助平台，不忘初心，为科研助力

本站互助的所有文件仅供个人学习研究用，禁止任何人把求助的所得文献进行盈利或传播

皖ICP备2024041134号-1

皖公网安备34019202002308

科研通【文献互助QQ群】：826996720【点击一键加群】如果您有特殊求助，或发布求助超过24小时未得到应助，可加群求助

科研通【志愿服务QQ群】：如果您热爱文献互助，有热心愿意为更多人服务，请加入小伙伴群，点击申请加入

关注微信服务号

科研通