There is a great deal of technical complexity in distributed data parallel, model parallel, tensor parallel, fully-sharded-data-parallel, ZeRO 1/2/3, context parallel etc. To fit ideas into a picture I tend to segment in-practice parallelism into three buckets.
- Data parallel
- Pipeline parallelism
- Everything else
I will tackle these out of order:
Data parallelism
Data parallel is a very convenient training concept because you take multiple copies of the model and give different data to them, do a forward-backward pass, then all-reduce gradients between them i.e., aggregate gradients across all devices so they operate as if they had trained on the unified set of data. This lets you train more quickly by using more devices!
This all-reduce operation can usually be effectively overlapped with other computations, minimizing the overhead of data parallelism, and the copies of the model can be largely anything: a model running on a single GPU, or a cluster of devices running many parallelisms of their own.
Because you get to run a whole batch of the computation through the model before needing comms, its relatively network efficient as well, which makes it plausible to do distributed data parallel over more normal network links (than the ones I’m about to mention).
Everything else
Almost everything about these parallelism methods are annoying: either the communication collectives are difficult to overlap with other operations, or they must occur more frequently, leading to increased sensitivity to latency.
This means you want faster links, which means you need very tight networking like, canonically, NVLink, which is Nvidia’s high-speed interconnect between GPUs. This tends to mean you have tensor parallelism, context parallelism etc. implemented within these fast domains, which traditionally is one host (8 GPUs), but with Blackwell can be quite a lot more (up to 36 or 72). There are a lot of options for efficiently packing the compute and scheduling work between devices in there based on the characteristics of the model, and they form this big set of potential parallelisms.
Pipeline parallelism
When your model unit is too big for the available “fast” networking domain, you try and divide the model into multiple sequential stages (like layers) and process them in parallel, overlapping the computation of one stage with the communication of the next.
This is painful for all the reasons pipelining in anything is painful but is additionally painful in that you have to write your training code with, effectively, a bunch of if statements for whichever stage it happens to be in.
There is some leakage between these buckets (e.g. the first and last stages of the pipeline tend to vary a lot due to dealing with embeddings for input and output), but it’s a reasonable first approximation to treat them as discrete.
(Or FSDP everything)
This does presume quite a lot of scale, as the cost of getting everything working well together is non-trivial. FSDP (preferably FSDP2 in PyTorch at least) directly mixes up between the DP and Everything Else buckets, and works pretty well for most folks up to a decently large (100s+) number of GPUs. So roughly:
- Smallish model, lots of data: Distributed data parallel.
- Medium sized model, medium sized number of GPUs: FSDP2
- Anything expensive: units of <parallelisms>, chunked into pipelines if the model is too big, copied into multiple DP copies to speed up training.