Tianshi Wang,Hongwei Kan,Qibo Sun,Shan Xiao,Shangguang Wang
标识
DOI:10.1109/icss55994.2022.00010
摘要
Researchers and practitioners are exploiting Remote Direct Memory Access (RDMA) technology to improve the efficiency of distributed machine learning and meet the demands of data-center applications. RDMA requires lossless network link to fully unleash its power. RDMA Over Converged Ethernet (RoCE) v2 focuses on congestion control, but fails to achieve efficient packet loss recovery; Improved RoCE NIC (IRN) addresses this issue based on RoCEv2, but does not use the Priority-based Flow Control (PFC) to maintain the advantage of RoCEv2 in detecting congestion. This paper proposes a method of congestion detection and link control via feedback in RDMA transmission, namely Feedback Data Flow Control (FDFC), that does not rely on PFC. FDFC detects and controls the link condition in real time to achieve the goals of precise detection, congestion control, and efficient packet loss recovering.