The Frontier of Pipeline Parallelism: An Overview
The latest in optimizing conflicting tradeoffs for parallel model training at scale
This blog post about pipeline parallelism was originally a final project for a graduate course at CMU called “Machine Learning with Large Datasets”. Our assignment was to create a submission for the ICLR blog posts track.
This was written by me and my teammate Owen Li.
Motivation: Deep Learning and Distributed Systems
In modern machine learning, large deep neural networks with billions of parameters are run on distributed systems, with a single model being trained on many different machines. This presents a problem for model developers, who have to design machine learning systems for distributed training.
Training modern large-scale models requires effective techniques for partitioning and parallelizing computation. This problem ultimately results in three key traits to optimize for:
- how long the model will take to train
- how much memory it will require, and
- how efficiently the available accelerators (i.e. GPUs) will be used.
The efficiency of accelerator usage is related both to how long the model takes to train and how many resources may be required for a given model size.
Multiple types of parallelization techniques have emerged as model sizes and distributed training architectures have grown.
Parallel Model Training: Data vs Model Parallelism
Data parallelism helps to parallelize computation by creating copies of the model on each accelerator, then partitioning, or “sharding”, the data and distributing it to the different devices. The results must be aggregated for the backwards propagation step. This can help to distribute a small enough model over multiple GPUs, but cannot address the case when the model itself is too large to fit on a single GPU.
In model parallelism, the model itself is divided into multiple partitions of the original model, then allocated to different devices. There are two ways to partition a model–“horizontally” and “vertically”.
The horizontal approach, “tensor parallelism” or intra-layer parallelism, splits each tensor in the model into different chunks, and during processing results are synchronized at each step. The vertical approach, “pipeline parallelism” or inter-layer parallelism, segments the model into stages, like a pipeline, and only the resulting activation values and gradients need to be transmitted between GPUs. Tensor parallelism generally has higher communication overhead than pipeline parallelism because all the results of these partial tensor computations must be transmitted.
Of course, it is possible to use the idea of sharding data from data parallelism in addition to partitioning the model, known as hybrid parallelism. Recent approaches use many specific techniques to try to improve overall efficiency. Though it is unrealistic to absolutely minimize all three of storage, computation, and communication costs, a given approach will try to find the best tradeoff between them by attacking each of these issues with a specific technique.
Naive Model Parallelism, and a Problem
Consider the process of training a model consisting of p fully connected layers: there is a sequence of forward steps f1 ,…, fp, followed by a sequence of back-propagation steps bp, bp−1 ,…, b1. If the size of the model is too big, it may be necessary to spread the training task among several devices. In particular, since each pair of forward and backward passes may use the same data, like the weight WW of the fully-connected neural network layer, a natural choice is to designate a single “device” to handle both forward and backward passes of the same stage.
Naively, given computational devices d1 ,…, dp, one might accomplish such a task in the following manner:
1. Use d_1 to compute x_1 = sigma(W_1 @ x_0)
2. Use d_2 to compute x_2 = sigma(W_2 @ x_1 )
.
.
.
p. Use d_p to compute y = sigma(W_p @ x_{p-1});
Obtain gradient from loss function associated with y.
p+1. Use d_p to compute gradient L_p using gradient of loss and W_p,
then update W_p.
p+2. Use d_{p-1} to compute gradient L_{p-1} using L_p and W_{p-1},
then update W_{p-1}.
.
.
.
2p. Use d_1 to compute gradient L_1, and update W_1
If we were to create a plot representing what each of the devices are doing at any time, it would look like the following:
Notice how each device (ie. GPU) is idle for most of the time? If we were to consider a very simplistic case where each task, whether it is forward or backward computation, takes exactly 1 second to complete, then each device is only working for 2 out of the 2p seconds, meaning a lot of hardware is sitting around, doing nothing. This idling is indicated by the white spaces in the plot, known as bubbles. The term bubble ratio is used to describe the ratio of such hardware idle time.
Evidently, having a high bubble ratio is inefficient usage of computing resources. By allocating computing tasks on GPU properly, we may make hardware usage or model training more efficient; such allocation strategies are called Pipeline Scheduling, which we discuss in the following section.
Pipeline Scheduling
In general, pipeline parallelism can be divided into synchronous and asynchronous scheduling approaches. The naive approach described is a synchronous approach, referring to the fact that, periodically, all model parameters are synchronously updated with the accumulated gradients at a given stage. This is the most statistically efficient approach as each update uses all learned information. As shown earlier in the example of naive pipeline parallelism, letting computing devices wait for the accumulated gradients leads to considerable device idling, and thus a high bubble ratio.
Asynchronous approaches do not wait to update with all accumulated gradients, allowing higher GPU utilization and reduces bubble ratio. But this also means that later mini-batches in the pipeline may derive gradients with stale weights, which refers to weights that were computed in the past. This harms statistical efficiency, as we are updating the model weights using slightly outdated information.
Given the tradeoffs between the two schedules, any particular implementation of each uses different techniques to try to improve the bubble ratio in the synchronous case, and the learning efficiency in the asynchronous case. Each attempt to mitigate these traits can come with memory and communication trade-offs in exchange for more efficient compute utilization or statistical efficiency.
If synchronous scheduling is the most statistically efficient, why would you ever use an asynchronous schedule? Well, if the learning task were some ideal, parabolic convex optimization case, and moreover we have all the GPUs in the world, then it’s possible there’d be no point to using any approach that harms the numerical progression down the true global minimum. But real machine learning scenarios typically have the following traits:
- Objective function is messier, so in most cases, we measure the model’s actual performance on a problem, and accurate convergence to global min may not be the biggest concern, and
- We do not have all the GPUs in the world, so having some of them staying idle during model training is not economical.
Essentially, it may be advantageous to get to a less-than-ideal convergence faster while making better utilization of our available compute resources, especially if “faster” is the difference between practically possible or not.
Practical Approaches to Pipeline Parallelism
Now, we’ll look at several different pipeline parallelism approaches and how they choose to mitigate different tradeoffs. As a reminder, in the ideal case, we’d like to achieve:
- minimal memory consumption
- low communication overhead
- efficient compute utilization (for synchronous approaches, we would like to reduce the bubble ratio
- maximum statistical efficiency (for asynchronous approaches, we have to handle weight staleness and inconsistency)
Finally, we also want an even workload distributed across our GPUs, or load balance, since one GPU working very hard and another not working that hard is not much better than one active GPU and one idle GPU.
GPipe: The Representative Synchronous Approach
One simple way to somewhat improve on this naive baseline is to segment mini-batches of data so that at least, during each forward and backward pass, each device can be working on a different segment of the mini-batch–or a “micro-batch”. This is how GPipe by Huang et. al. [3]tries to improve on the naive baseline, and it is representative of most practical synchronous approaches.
- GPipe communicates only the output data of one model partition (possibly larger or smaller than a layer)
- GPipe synchronizes gradients and the optimizer step across micro-batches within a mini-batch
- The user specifies the number of micro-batches and the number of model partitions (equal to the number of available GPUs)
- GPipe stores one version of weights total
A direct consequence of GPipe’s approach is higher peak memory required to store more activation values. This is somewhat mitigated by discarding and re-calculating activation values during the backward pass. However, re-calculation introduces a 20% increase in computation overhead [4].
It shall also be noted that since GPipe is a synchronous approach, some computing devices may need to be waiting for the weight to synchronize, causing nontrivial bubble ratios.
PipeDream: The Representative Asynchronous Approach
Pipedream by Narayanan et. al. [5] is representative of the group of asynchronous approaches to pipeline parallelism, meaning that it occasionally must compute gradients of a mini-batch with “stale” weights, which can reduce learning efficiency (i.e., not every update has the same amount of information that it would in traditional model training)
- Pipedream communicates only the output data of a set of model layers (partitioned according to number of GPUs)
- Pipedream uses asynchronous computations of gradients
— This technically reduces statistical efficiency since gradients can be computed on stale weights. - Pipedream continuously calculates the number of optimal mini-batches at runtime.
- Pipedream stores one version of weights per mini-batch
PipeDream introduced the “interleaved 1F1B” or “one forward one backward” approach, where forward and backward passes of different micro-batches are interleaved to eliminate bubbles. However, PipeDream’s approach to improving statistical efficiency increases peak memory. Weight stashing maintains multiple versions of weights, matching each active mini-batch, and these are used for the forward pass. They are then also cached for the backward pass, ensuring that within a stage, the same versions of parameters are used. Across stages, stashed weights are used as opposed to the latest weight update–this eliminates inconsistency across the pipeline, but does not eliminate weight staleness. (The degree of weight staleness is formalized in [5].)
PipeMare: Improved Asynchronous Approach
Pipemare by Yang et. al. [6] improves the memory usage of PipeDream by approximating weights that appeared earlier in the pipeline, instead of caching them. To ensure convergence, it also schedules the learning rate accordingly. It strikes a perfect balance between GPipe and PipeDream, as it has virtually no bubble, and its memory usage is also relatively low.
The idea of PipeMare is to first do whatever PipeDream does, but then simply use whatever weight W in the memory to compute the gradient during backward step, and avoid caching stale weights like PipeDream originally does. Since no device is waiting for a specific version of weights, there is no bubble; since there’s no need to store older weights, the usage of memory is efficient. Both PipeDream and GPipe’s issues are resolved!
Sounds intuitive, but what could go wrong?
Imagine training a stack of fully connected layers. Note that at the kkth fully connected layer, computing gradient gkgk from backpropagation steps depends on two things: (1) some version of model weight stored in device dkdk , and (2) the loss at the model output layer, which depends on the forward pass at kkth layer, processed by the same device dkdk at an earlier time, using an earlier version of model weights. In essence, the idea of PipeMare would result in computing the gradient using two different versions of model weights! We express this dependency as:
This phenomenon is called Delay Discrepancy. In comparison, gradient descent is only defined using a fixed version of function input (ie. model weights), as below:
where for GPipe, w′=w, and for PipeDream, w′ stands for a single version of stale model weight.
This discrepancy brings two issues:
- Since we are performing gradient descent using inconsistent versions of weights, would convergence be an issue?
- How do we know “w_older” without caching it?
PipeMare’s novelty involves resolving these two problems separately:
For the first issue, we may locally approximate the objective function for simplicity:
then the gradient can be seen as:
where η denotes a noise term and Δ is the sensitivity of ∇f to the discrepancy of gradient. Then if we treat the collection of all versions of gradient over time as a single vector:
then the gradient update rule can be written as some linear equation:
where C is some suitable matrix, and and e_1 is one-hot vector with the first entry being 1. Convergence of gradient descent can then be guaranteed by solving for a learning rate α that lets eigenvalues of C stay in the unit ball, and the weight update rule would be stable. Specifically, [5] shows that longer delays require smaller step sizes to ensure convergence.
For (2), we substitute approximation
into the linear equation mentioned earlier, where Δtime() denotes the difference in time-stamp, and δ is a trainable parameter that estimates how quickly model weights change over time. This allows estimating older weights using newer weights, eliminating the need to cache multiple versions of stale weights, as PipeDream did.
Zero-Bubble: An Improved Synchronous Approach
The Zero-Bubble (or “ZB”) synchronous approach introduced by Qi et al. [4] achieved zero-bubble pipeline parallelism under synchronous scheduling. This was made possible by their innovation of splitting up the gradient computation in the backward pass, such that this, too, could be interleaved to eliminate bubbles. The approach demonstrates that grouping the backward pass calculations together sequentially is unnecessary.
There are two version of the ZB approach: ZB-H1, which consumes the same peak memory usage as 1F1B introduced by PipeDream [5], and ZB-H2, which eliminates bubbles completely but increases peak memory usage and has some extra computation needed. To eliminate bubbles completely, ZB-H2 also bypasses optimizer synchronization and instead introduces a validation and rollback step to rectify any miscalculations after the optimizer step (this is what results in some extra computation).
While ZB has a handcrafted schedule that works under the assumption that the execution times of the forward pass and each interleaved backward pass calculation are identical, an algorithm to automatically compose schedules is also introduced. The scheduling problem is formulated as integer linear programming that can be solved by an off-the-shelf ILP solver.
Comparisons and Trade-offs
The previous section only discussed a few of these approaches in detail. However, if we broadly examine pipeline parallelism methods and their performance metrics in memory usage, computate resource utilization, and convergence, we see some interesting trade-offs. Below is a notation key and table comparing different approaches as compiled by Guan et. al. [2]
Some of the techniques in this comparison table also require extra memory, computation, and communication overhead that do not scale directly with the included parameters.
Observe that these approaches can be broadly categorized as follows:
- Synchronous approaches, like GPipe and DAPPLE, which do not use stale weights for computing gradients.
— These wait for newest weights before computing gradients.
— Computing devices may need to stay idle, leading to nonzero bubble ratios.
— Due to being synchronous, they have high statistical efficiency and therefore excellent convergence. - Asynchronous approaches that uses stale weights for gradient calculation, which allows devices to never go idle.
— These effectively eliminate bubbles.
— Due to being asynchronous, they have relatively lower statistical efficiency, and slightly worse convergence behavior.
Asynchronous approaches can further be refined as:
- Approaches that caches stale weights to handle weight discrepancy, like PipeDream, AvgPipe, and WPipe. These approaches don’t need extra computation overhead, but always uses no less than 2M_θ memory for storing weights.
- Approaches that uses extra computation overhead to predict or approximate older versions’ weights, like Pipemare, XPipe, and SpecTrain. These approaches only need to store one copy of model weights, so the weight memory usage is always M_θ.
Two notable exceptions include AMPNet and ZeroBubble. AMPNet neither caches extra copies of stale weights nor has computation overhead, but converges poorly. ZeroBubble’s ZB-H2 approach separates back-propagation calculations so they can be interleaved, and achieves zero bubbles under synchronous training semantics.
Conclusion
The comparison above can assist model developers in choosing which approach to apply for optimizing computation and memory efficiency that scales with the model. Ultimately, for the ideal balance of zero bubbles in the pipeline, combined with synchronous training semantics, and a schedule that is optimally calculated with integer linear programming (ILP), ZeroBubble’s ZB-H2 approach would be an ideal starting point and represent the current state of the art in pipeline parallelism. However, the algorithm for computing ideal schedules may not scale well with an off-the-shelf ILP solver. This could lead to new approaches to optimize the algorithm, or construct ILP solutions specifically for machine learning applications. Like other trends in machine learning, such as custom-built accelerators, pipeline parallelism could benefit from optimization tools specifically customized to machine learning applications.
References
- UvA Deep Learning Tutorials. P. Lippe. 2024.
- Advances of pipeline model parallelism for deep learning training: An overview. L. Guan, D. Li, J. Liang, W. Wang, K. Ge, X. Lu. Journal of Computer Science and Technology, Vol 39(3), pp. 567–584. Springer. 2024.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q.V. Le, Y. Wu, others.
Advances in neural information processing systems, Vol 32. 2019. - Zero Bubble (Almost) Pipeline Parallelism
Qi, P., Wan, X., Huang, G. and Lin, M., 2024. The Twelfth International Conference on Learning Representations. - PipeDream: Generalized pipeline parallelism for DNN training
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N.R., Ganger, G.R., Gibbons, P.B. and Zaharia, M., 2019. Proceedings of the 27th ACM symposium on operating systems principles, pp. 1–15. - Pipemare: Asynchronous pipeline parallel dnn training
Yang, B., Zhang, J., Li, J., Re, C., Aberger, C. and De Sa, C., 2021. Proceedings of Machine Learning and Systems, Vol 3, pp. 269–296.