Write a Blog >>
PPoPP 2022
Sat 2 - Wed 6 April 2022
Mon 4 Apr 2022 13:05 - 13:20 - Session 3 Chair(s): Bin Ren

The current trend in deep learning is to scale models to extremely large size with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pre-trained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication.

To address these challenges, we first propose a performance model that can both accurately predict the latency of different operations of a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, we invent a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. We design a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed. We implement and integrate above optimizations as a general system, , empowering efficient distributed MoE model training. is evaluated on different cluster systems using up to $64$ GPUs. It achieves $1.37\times$ - $17.87\times$ speedup compared with state-of-the-art systems for large models, including ZeRO, GShard, and BASE Layer.

Source code of is now available at https://github.com/thu-pacman/FasterMoE.

Mon 4 Apr

Displayed time zone: Eastern Time (US & Canada) change

12:50 - 13:35
Session 3Main Conference
Chair(s): Bin Ren Pacific Northwest National Laboratories
QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core
Main Conference
Yuke Wang UC Santa Barbara, Boyuan Feng University of California Santa Barbara, Yufei Ding University of California at Santa Barbara
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
Main Conference
Jiaao He Tsinghua University, China, Jidong Zhai Tsinghua University, Tiago Antunes Tsinghua University, Haojie Wang Tsinghua University, Fuwen Luo Tsinghua University, Shangfeng Shi Tsinghua University, Qin Li Tsinghua University
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Main Conference
Shigang Li ETH Zurich, Torsten Hoefler ETH Zurich