摘要
Nowadays, machine learning has become an indispensable service for cloud providers. Elastic training, a novel training paradigm that dynamically adjusts resource allocation for a group of training jobs, has gained popularity due to its ability to effectively utilize accelerators, which are essential for training a massive number of deep learning models. Despite the existence of numerous scheduling algorithms for elastic training, most lack an easy-to-use yet efficient platform to execute them. In this paper, we present Voda, a GPU scheduling platform for elastic deep learning. In contrast to prior approaches that employ parameter servers for elastic training, Voda is designed for AllReduce-style communication, which proves to be more effective, albeit more complex to adjust. Voda, built on top of Kubernetes, consists of a set of loosely coupled components that collect runtime information, dynamically alter the resource allocation, and optimize job placement based on communication costs among underlying GPUs. We implement and compare four scheduling algorithms for elastic training, including three existing methods and one newly proposed, on Voda, with different workloads, job distributions, and arrival patterns. Experimental results demonstrate that no single algorithm dominates all performance metrics, such as average job completion time, running time, or makespan. However, certain algorithms outperform others under specific workloads and job distributions. Additionally, our experiments highlight the significance of job placement in GPU clusters, and our proposed method effectively optimizes communication costs among different workers of a job.