计算机科学
可扩展性
并行计算
PCI Express
库达
程序设计范式
架空(工程)
嵌入式系统
操作系统
现场可编程门阵列
程序设计语言
作者
Harini Muthukrishnan,Daniel C. Lustig,Oreste Villa,Thomas F. Wenisch,David Nellans
标识
DOI:10.1109/hpca56546.2023.10070949
摘要
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU’s weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU’s virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3×. This results in 4-GPU strong scaling performance 1.4× better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI