PredictDDL: Reusable Workload Performance Prediction for Distributed Deep Learning

工作量 计算机科学 再培训 机器学习 人工智能 人工神经网络 推论 深度学习 性能预测 模拟 操作系统 国际贸易 业务
作者
Kevin Assogba,Eduardo Lima,M. Mustafa Rafique,Minseok Kwon
标识
DOI:10.1109/cluster52292.2023.00009
摘要

Accurately predicting the training time of deep learning (DL) workloads is critical for optimizing the utilization of data centers and allocating the required cluster resources for completing critical model training tasks before a deadline. The state-of-the-art prediction models, e.g., Ernest and Cherrypick, treat DL workloads as black boxes, and require running the given DL job on a fraction of the dataset. Moreover, they require retraining their prediction models every time a change occurs in the given DL workload. This significantly limits the reusability of prediction models across DL workloads with different deep neural network (DNN) architectures. In this paper, we address this challenge and propose a novel approach where the prediction model is trained only once for a particular dataset type, e.g., ImageNet, thus completely avoiding tedious and costly retraining tasks for predicting the training time of new DL workloads. Our proposed approach, called PredictDDL, provides an end-to-end system for predicting the training time of DL models in distributed settings. PredictDDL leverages Graph HyperNetworks, a class of neural networks that takes computational graphs as input and produces vector representations of their DNNs. PredictDDL is the first prediction system that eliminates the need of retraining a performance prediction model for each new DL workload and maximizes the reuse of the prediction model by requiring running a DL workload only once for training the prediction model. Our extensive evaluation using representative workloads shows that PredictDDL achieves up to 9.8× lower average prediction error and 10.3× lower inference time compared to the state-of-the-art system, i.e., Ernest, on multiple DNN architectures.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
晁子枫完成签到 ,获得积分10
1秒前
小可爱完成签到 ,获得积分10
3秒前
同學你該吃藥了完成签到 ,获得积分10
5秒前
rafa完成签到 ,获得积分10
5秒前
整齐芷文完成签到,获得积分10
12秒前
丰富的硬币完成签到,获得积分10
12秒前
15秒前
16秒前
巧克力完成签到 ,获得积分10
18秒前
宁霸完成签到,获得积分0
19秒前
20秒前
整齐百褶裙完成签到 ,获得积分10
20秒前
24秒前
jin完成签到,获得积分10
24秒前
欣喜的薯片完成签到 ,获得积分10
25秒前
最棒哒完成签到 ,获得积分10
27秒前
neckerzhu完成签到 ,获得积分10
36秒前
FangyingTang完成签到 ,获得积分10
44秒前
贪玩的谷芹完成签到 ,获得积分10
47秒前
一区种子选手完成签到,获得积分10
48秒前
48秒前
50秒前
long完成签到 ,获得积分10
51秒前
h41692011完成签到 ,获得积分10
52秒前
zgt01完成签到 ,获得积分10
54秒前
陈陈完成签到 ,获得积分10
54秒前
胖一达完成签到 ,获得积分10
56秒前
sunshine完成签到 ,获得积分10
58秒前
Zhao完成签到 ,获得积分20
59秒前
大二郎发布了新的文献求助10
1分钟前
cdercder应助韩野采纳,获得10
1分钟前
CLTTTt完成签到,获得积分10
1分钟前
吱吱吱完成签到 ,获得积分10
1分钟前
风信子deon01完成签到,获得积分10
1分钟前
行云流水完成签到,获得积分10
1分钟前
集典完成签到 ,获得积分10
1分钟前
24K纯帅完成签到,获得积分10
1分钟前
秋水完成签到 ,获得积分10
1分钟前
tyt完成签到 ,获得积分10
1分钟前
yang完成签到 ,获得积分10
1分钟前
高分求助中
Thinking Small and Large 500
Algorithmic Mathematics in Machine Learning 500
Mapping the Stars: Celebrity, Metonymy, and the Networked Politics of Identity 400
Getting Published in SSCI Journals: 200+ Questions and Answers for Absolute Beginners 300
Engineering the boosting of the magnetic Purcell factor with a composite structure based on nanodisk and ring resonators 240
Study of enhancing employee engagement at workplace by adopting internet of things 200
Minimum Bar Spacing as a Function of Bond and Shear Strength 200
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3837587
求助须知:如何正确求助?哪些是违规求助? 3379715
关于积分的说明 10510193
捐赠科研通 3099320
什么是DOI,文献DOI怎么找? 1707062
邀请新用户注册赠送积分活动 821402
科研通“疑难数据库(出版商)”最低求助积分说明 772615