Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts

计算机科学变压器建筑任务（项目管理）计算机体系结构任务分析人机交互人工智能嵌入式系统工程类电气工程系统工程电压视觉艺术艺术

作者

Rishov Sarkar,Hanxue Liang,Zhiwen Fan,Zhangyang Wang,Cong Hao

标识

DOI：10.1109/iccad57390.2023.10323651

摘要

The computer vision community is embracing two promising learning paradigms: the Vision Transformer (ViT) and Multi-task Learning (MTL). ViT models show extraordinary performance over traditional convolution networks but are commonly recognized as computation-intensive, especially the self-attention with quadratic complexity. MTL uses one model to infer multiple tasks with better performance by enforcing shared representation among tasks, but a huge drawback is that, most MTL regimes require activation of the entire model even when only one or a few tasks are needed, causing significant computing waste. M ³ ViT is the latest multi-task Vi $T$ model that introduces mixture-of-experts (MoE), where only a small portion of subnetworks ("experts") are sparsely and dynamically activated based on the current task. M ³ Vi $T$ achieves better accuracy and over 80% computation reduction and paves the way for efficient real-time MTL using ViT. Despite the algorithmic advantages of MTL, ViT, and even M ³ ViT, there are still many challenges for efficient deployment on FPGA. For instance, in general Transformer/ViT models, the self-attention is known as computational intensive and requires high bandwidth. In addition, softmax operations and the activation function GELU are extensively used, which unfortunately can consume more than half of the entire FPGA resource (LUTs). In the M ³ ViT model, the promising MoE mechanism for multi-task exposes new challenges for memory access overhead and also increases resource usage because of more layer types. To address these challenges in both general Transformer/ViT models and the state-of-the-art multi-task M ³ ViT with MoE, we propose Edge-MoE, the first end-to-end FPGA accelerator for multi-task ViT with a rich collection of architectural innovations. First, for general Transformer/ViT models, we propose (1) a novel reordering mechanism for self-attention, which reduces the bandwidth requirement from proportional to constant regardless of the target parallelism; (2) a fast single-pass softmax approximation; (3) an accurate and low-cost GELU approximation, which can significantly reduce the computation latency and resource usage; and (4) a unified and flexible computing unit that can be shared by almost all computational layers to maximally reduce resource usage. Second, for the advanced multi-task M ³ ViT with MoE, we propose a novel patch reordering method to completely eliminate any memory access overhead. Third, we deliver on-board implementation and measurement on Xilinx ZCU102 FPGA, with verified functionality and open-sourced hardware design, which achieves 2.24× and 4.90× better energy efficiency comparing with GPU (A6000) and CPU (Xeon 6226R), respectively. A real-time video demonstration of our accelerated multi-task ViT on an autonomous driving dataset is available in GitHub, ¹ ¹ https://github.com/sharc-lab/Edge-MoE/raw/main/demo.mp4 together with our FPGA design using High-Level Synthesis, host code, FPGA bitstream, and on-board performance results.

求助该文献

最长约 10秒，即可获得该文献文件

Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts

今日热心研友