Write a Blog >>
PPoPP 2022
Sat 2 - Wed 6 April 2022
Mon 4 Apr 2022 13:20 - 13:35 - Session 3 Chair(s): Bin Ren

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse \textit{allreduce} algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

Mon 4 Apr

Displayed time zone: Eastern Time (US & Canada) change

12:50 - 13:35
Session 3Main Conference
Chair(s): Bin Ren Pacific Northwest National Laboratories
12:50
15m
Talk
QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core
Main Conference
Yuke Wang UC Santa Barbara, Boyuan Feng University of California Santa Barbara, Yufei Ding University of California at Santa Barbara
13:05
15m
Talk
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
Main Conference
Jiaao He Tsinghua University, China, Jidong Zhai Tsinghua University, Tiago Antunes Tsinghua University, Haojie Wang Tsinghua University, Fuwen Luo Tsinghua University, Shangfeng Shi Tsinghua University, Qin Li Tsinghua University
13:20
15m
Talk
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Main Conference
Shigang Li ETH Zurich, Torsten Hoefler ETH Zurich