FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models (PPoPP 2022 - Main Conference)

Who

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, Qin Li

Track

PPoPP 2022 Main Conference

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 4 Apr 2022 13:05 - 13:20 - Session 3 Chair(s): Bin Ren

Abstract

The current trend in deep learning is to scale models to extremely large size with the objective of increasing their accuracy. Mixture-of-Expert (MoE) is the most popular pre-trained model that makes feasible the training of models with parameters beyond trillion-scale. Thanks to the dynamic activation of experts, i.e., shallow layers specialized in certain domains, it allows for sparse training of bigger models, removing the linearity between model size and computation. However, different from traditional deep learning models, it draws huge challenges to the efficiency of these training systems, including dynamic load imbalance, inefficient synchronous execution mode, and congested all-to-all communication.

To address these challenges, we first propose a performance model that can both accurately predict the latency of different operations of a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, we invent a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. We design a congestion-avoiding expert selection strategy that relieves network congestion for the lower latency of iterations, when modification of expert selection is allowed. We implement and integrate above optimizations as a general system, , empowering efficient distributed MoE model training. is evaluated on different cluster systems using up to $64$ GPUs. It achieves $1.37\times$ - $17.87\times$ speedup compared with state-of-the-art systems for large models, including ZeRO, GShard, and BASE Layer.

Source code of is now available at https://github.com/thu-pacman/FasterMoE.

Jiaao He

Tsinghua University, China

China

Jidong Zhai

Tsinghua University

China

Tiago Antunes

Tsinghua University

Haojie Wang

Tsinghua University

Fuwen Luo

Tsinghua University

Shangfeng Shi

Tsinghua University

Qin Li

Tsinghua University

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 4 Apr
Displayed time zone: Eastern Time (US & Canada) change

12:50 - 13:35	Session 3Main Conference Chair(s): Bin Ren Pacific Northwest National Laboratories

12:50 15m Talk		QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core Main Conference Yuke Wang UC Santa Barbara, Boyuan Feng University of California Santa Barbara, Yufei Ding University of California at Santa Barbara
13:05 15m Talk		FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models Main Conference Jiaao He Tsinghua University, China, Jidong Zhai Tsinghua University, Tiago Antunes Tsinghua University, Haojie Wang Tsinghua University, Fuwen Luo Tsinghua University, Shangfeng Shi Tsinghua University, Qin Li Tsinghua University
13:20 15m Talk		Near-Optimal Sparse Allreduce for Distributed Deep Learning Main Conference Shigang Li ETH Zurich, Torsten Hoefler ETH Zurich