PRC: Process-centric Rate Control for Transient Stragglers in Hybrid-Parallel Training Journal Article uri icon

Overview

abstract

  • Stragglers remain an open challenge in large model training, forcing other GPUs to wait at synchronization barriers. While severe, persistent stragglers have been actively studied, their transient counterparts are often overlooked due to their short-lived nature. We reveal that the interplay between the transient straggler effects and network contention drastically amplifies the relative delays across workers, significantly slowing down training iterations. To address this, an intuitive approach is to prioritize bandwidth for the straggler, allowing it to catch up to the front-runners. However, while the transport layer is well-positioned to accommodate such responsive bandwidth control, conventional designs lack visibility into process-level progress. To bridge this gap, we propose PRC (Process-centric Rate Control), a new sending-rate control designed to mitigate the transient straggler effects. PRC adjusts NIC sending rates by inferring local GPU process-level information, enabling transient straggler processes to utilize more bandwidth on time. Extensive experiments on both a real-world cluster and large-scale simulations confirm that PRC effectively accelerates transient stragglers, achieving a training speedup of up to 28% compared to using state-of-the-art datacenter congestion control schemes.

publication date

  • June 1, 2026

Date in CU Experts

  • June 11, 2026 5:26 AM

Full Author List

  • Han T; Hwang D; Park S; Yang D; Park G; Han S; Hwang J; Kang B; Ha S; Lee K

author count

  • 10

Other Profiles

Electronic International Standard Serial Number (EISSN)

  • 2834-5509

Additional Document Info

start page

  • 1

end page

  • 23

volume

  • 4

issue

  • CoNEXT2