计算机科学
强化学习
调度(生产过程)
人工智能
机器学习
作业车间调度
任务(项目管理)
分布式计算
数学优化
计算机网络
数学
布线(电子设计自动化)
经济
管理
作者
Yihong Li,Xiaoxi Zhang,Tianyu Zeng,Jingpu Duan,Chuan Wu,Di Wu,Xu Chen
标识
DOI:10.1109/tpds.2023.3313779
摘要
Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger , a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9% reduction in the average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.
科研通智能强力驱动
Strongly Powered by AbleSci AI