FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
The current trend in deep learning is to scale models to extremely large size with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pre-trained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication.
To address these challenges, we first propose a performance model that can both accurately predict the latency of different operations of a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, we invent a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. We design a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed. We implement and integrate above optimizations as a general system, , empowering efficient distributed MoE model training. is evaluated on different cluster systems using up to $64$ GPUs. It achieves $1.37\times$ - $17.87\times$ speedup compared with state-of-the-art systems for large models, including ZeRO, GShard, and BASE Layer.
Source code of is now available at https://github.com/thu-pacman/FasterMoE.
Mon 4 AprDisplayed time zone: Eastern Time (US & Canada) change
12:50 - 13:35
|QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core
|FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
|Near-Optimal Sparse Allreduce for Distributed Deep Learning