发布文献求助

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

计算机科学强化学习人工智能分布式计算机器学习服务器工作量可扩展性计算机网络操作系统

作者

Yixin Bao,Yanghua Peng,Chuan Wu

出处

期刊：IEEE ACM Transactions on Networking [Institute of Electrical and Electronics Engineers]
日期：2022-09-08 卷期号：31 (2): 634-647 被引量：22

标识

DOI：10.1109/tnet.2022.3202529

摘要

Nowadays, most leading IT companies host a variety of distributed machine learning (ML) workloads in ML clusters to support AI-driven services, such as speech recognition, machine translation, and image processing. While multiple jobs are executed concurrently in a shared cluster to improve resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers, such as YARN and Mesos, are interference-agnostic in their job placement, leading to suboptimal resource efficiency and usage. Some literature has studied interference-aware job placement policy, but relies on detailed workload profiling and interference modeling, which is not a general solution. In this work, we present Harmony, a deep learning-driven ML cluster scheduler that places heterogeneous training jobs (either with parameter server architecture or all-reduce architecture) in a manner that minimizes interference and maximizes performance (i.e., training completion time minimization). The design of Harmony is based on a carefully designed deep reinforcement learning (DRL) framework enhanced with reward modeling. The DRL integrates a dynamic sequence-to-sequence model with the state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration, multi-head attention, and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary sequence-to-sequence reward prediction model, which is trained with historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 16%–42% in terms of average job completion time.

求助该文献

最长约 10秒，即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI

我的文献求助列表浏览历史

一分钟了解求助规则 | 捐赠本站 | 历史今天

更新

📰 新增『新锐期刊分区』 (2026-3-24)

更新

💬 新增更精细的自定义提醒设置 (2026-1-4)

新增

🕒 每天60秒读懂世界·精选全球要闻 (2026-1-2)

新增

PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台，具备全网最快的应助速度，最高的求助完成率。对每一个文献求助，科研通都将尽心尽力，给求助人一个满意的交代。

实时播报: 牛马完成签到，获得积分10

1秒前; 朝北完成签到，获得积分10

7秒前; 昵称什么的不重要啦完成签到，获得积分10

8秒前; ZHYIJ完成签到，获得积分10

13秒前; 大大怪将军完成签到，获得积分10

23秒前; 我本人lrx完成签到，获得积分10

26秒前; gxzsdf完成签到，获得积分10

27秒前; yang完成签到，获得积分10

28秒前; lhl完成签到，获得积分10

31秒前; 小小完成签到，获得积分10

41秒前; Lucas的应助被www采纳，获得10

44秒前; ElaineXU完成签到，获得积分10

45秒前; 又壮了完成签到，获得积分10

45秒前; 青青河边草完成签到，获得积分10

46秒前; 粗犷的月饼完成签到，获得积分10

49秒前; cdercder上传了应助文件

50秒前; 顺顺完成签到，获得积分10

52秒前; CuiC完成签到，获得积分10

54秒前; 风格完成签到，获得积分10

55秒前; 赤子心i完成签到，获得积分10

56秒前; 默默小馒头完成签到，获得积分10

56秒前; kryptonite完成签到，获得积分10

58秒前; 纯情的凡双完成签到，获得积分10

1分钟前; 隐形曼青上传了应助文件

1分钟前; WY完成签到，获得积分10

1分钟前; nanfeng完成签到，获得积分10

1分钟前; 李健的小迷弟的应助被CuiC采纳，获得10

1分钟前; ZZY发布了新的文献求助10

1分钟前; 牧鱼关闭了牧鱼的文献求助

1分钟前; 旺旺完成签到，获得积分10

1分钟前; liuxianglin2006完成签到，获得积分10

1分钟前; cdercder上传了应助文件

1分钟前; 豌豆完成签到，获得积分10

1分钟前; 牧鱼驳回了Hello的应助

1分钟前; Copyright的应助被科研通管家采纳，获得10

1分钟前; cdercder上传了应助文件

1分钟前; 认真觅荷完成签到，获得积分10

1分钟前; 上善若水呦完成签到，获得积分10

1分钟前; Orange的应助被ybheart采纳，获得10

1分钟前; 又见白龙完成签到，获得积分10

2分钟前

高分求助中: GL 2 A method for assessing the in-place cleanability of food processing equipment, Fourth Edition, December 2023 3000; Annie Ernaux: De la perte au corps glorieux 600; Writing Systems 500; Understanding Modeling and Simulation of Polymerization Reactions 400; Invited Discussant 63O and 64O 400; A revision of Limenitis helmanni and its related species (Nymphalidae) from Central and South China 400; Direct and Iterative Linear System Solvers 400

热门求助领域（近24小时）

热门帖子: 关注科研通微信公众号，转发送积分 6830343; 求助须知：如何正确求助？哪些是违规求助？ 8541308; 关于积分的说明 18172491; 捐赠科研通 6171591; 什么是DOI，文献DOI怎么找？ 3036524; 关于科研通互助平台的介绍 2020907; 邀请新用户注册赠送积分活动 2013521

今日热心研友

潇洒的惋清

勤恳的背包

龙的传人灬龙

注：热心度 = 本日应助数 + 本日被采纳获取积分÷10

Copyright © 2020-2026 AbleSci.COM, 科研通, All Right Reserved

科研通是非营利科研互助平台，不忘初心，为科研助力

本站互助的所有文件仅供个人学习研究用，禁止任何人把求助的所得文献进行盈利或传播

皖ICP备2024041134号-1

皖公网安备34019202002308

科研通【文献互助QQ群】：如果您有特殊求助，或发布求助超过24小时未得到应助，可加群求助，群号：821889395【点击一键加群】

科研通【志愿服务QQ群】：如果您热爱文献互助，有热心愿意为更多人服务，请加入小伙伴群，点击申请加入

关注微信服务号

科研通