计算机科学
深度学习
人工神经网络
可扩展性
比例(比率)
人工智能
卷积神经网络
作者
Lipeng Wang,Ye Songgao,Baichen Yang,Youyou Lu,Hequan Zhang,Shengen Yan,Qiong Luo
出处
期刊:International Conference on Parallel Processing
日期:2020-08-17
被引量:5
标识
DOI:10.1145/3404397.3404472
摘要
We observe three problems in existing storage and caching systems for deep-learning training (DLT) tasks: (1) accessing a dataset containing a large number of small files takes a long time, (2) global in-memory caching systems are vulnerable to node failures and slow to recover, and (3) repeatedly reading a dataset of files in shuffled orders is inefficient when the dataset is too large to be cached in memory. Therefore, we propose DIESEL, a dataset-based distributed storage and caching system for DLT tasks. Our approach is via a storage-caching system co-design. Firstly, since accessing small files is a metadata-intensive operation, DIESEL decouples the metadata processing from metadata storage, and introduces metadata snapshot mechanisms for each dataset. This approach speeds up metadata access significantly. Secondly, DIESEL deploys a task-grained distributed cache across the worker nodes of a DLT task. This way node failures are contained within each DLT task. Furthermore, the files are grouped into large chunks in storage, so the recovery time of the caching system is reduced greatly. Thirdly, DIESEL provides chunk-based shuffle so that the performance of random file access is improved without sacrificing training accuracy. Our experiments show that DIESEL achieves a linear speedup on metadata access, and outperforms an existing distributed caching system in both file caching and file reading. In real DLT tasks, DIESEL halves the data access time of an existing storage system, and reduces the training time by hours without changing any training code.
科研通智能强力驱动
Strongly Powered by AbleSci AI