Migrating Deep Learning Data and Applications among Kubernetes Edge Nodes

计算机科学有状态防火墙节点（物理） GSM演进的增强数据速率容器（类型理论）边缘计算边缘设备分布式计算分析推论安全性令牌星团（航天器）计算机网络聚类分析数据挖掘人工智能操作系统工程类云计算交通工程机械工程结构工程

作者

Suchanat Mangkhangcharoen,Jason Haga,Prapaporn Rattanatamrong

标识

DOI：10.1109/hpcc-dss-smartcity-dependsys53884.2021.00299

摘要

Many current IoT applications deployed at the edge use deep learning (DL) in their real-time processing and analytics. Not only inference but also training is moving to edge devices. DL application and dataset migration among these devices are mandatory for scenarios like node failure, user mobility or when nodes need to collaborate (e.g., distributed training). Container technologies and Kubernetes (K8s) are being increasingly adopted to manage infrastructure at the edge. Unfortunately, there is no built-in mechanism in K8s to support migration of stateful containers between its cluster nodes. The K8s cluster's master node generally launches a new fresh container in another node to replace the failed one. While there is an existing mechanism for migrating a Pod between K8s nodes, there is no past work investigating the migration of DL datasets and containerized DL applications among K8s cluster nodes. In this paper, we present our 1) comprehensive study on the effectiveness and limitations of existing checkpointing mechanisms for containerized DL applications and 2) our comparative performance study of several approaches in migrating DL datasets and applications in a K8s cluster. Our results show that migrating states of DL applications and restoring them from their previous states enables faster recovery (reducing training time by 10 to 73 percent) than re-running these models from the beginning regardless of the percentage of epochs that have completed. Additionally, our experimental results show that transferring a dataset between K8s workers using the K8s persistent volume with kubectl cp is generally suitable and efficient. However, when network latency is high, using our customized middleware with a feedback controller to migrate data in parallel can speed up total migration time compared to the K8s's persistent volume approach alone.

求助该文献

最长约 10秒，即可获得该文献文件

Migrating Deep Learning Data and Applications among Kubernetes Edge Nodes

今日热心研友