A very lazy mental model for parallelisms

Written by

There is a great deal of technical complexity in distributed data parallel, model parallel, tensor parallel, fully-sharded-data-parallel, ZeRO 1/2/3, context parallel etc. To fit ideas into a picture I tend to segment in-practice parallelism into three buckets.

Data parallel
Pipeline parallelism
Everything else

I will tackle these out of order:

Data parallelism

Data parallel is a very convenient training concept because you take multiple copies of the model and give different data to them, do a forward-backward pass, then all-reduce gradients between them i.e., aggregate gradients across all devices so they operate as if they had trained on the unified set of data. This lets you train more quickly by using more devices!

This all-reduce operation can usually be effectively overlapped with other computations, minimizing the overhead of data parallelism, and the copies of the model can be largely anything: a model running on a single GPU, or a cluster of devices running many parallelisms of their own.

Because you get to run a whole batch of the computation through the model before needing comms, its relatively network efficient as well, which makes it plausible to do distributed data parallel over more normal network links (than the ones I’m about to mention).

Everything else

Almost everything about these parallelism methods are annoying: either the communication collectives are difficult to overlap with other operations, or they must occur more frequently, leading to increased sensitivity to latency.

This means you want faster links, which means you need very tight networking like, canonically, NVLink, which is Nvidia’s high-speed interconnect between GPUs. This tends to mean you have tensor parallelism, context parallelism etc. implemented within these fast domains, which traditionally is one host (8 GPUs), but with Blackwell can be quite a lot more (up to 36 or 72). There are a lot of options for efficiently packing the compute and scheduling work between devices in there based on the characteristics of the model, and they form this big set of potential parallelisms.

Pipeline parallelism

When your model unit is too big for the available “fast” networking domain, you try and divide the model into multiple sequential stages (like layers) and process them in parallel, overlapping the computation of one stage with the communication of the next.

This is painful for all the reasons pipelining in anything is painful but is additionally painful in that you have to write your training code with, effectively, a bunch of if statements for whichever stage it happens to be in.

There is some leakage between these buckets (e.g. the first and last stages of the pipeline tend to vary a lot due to dealing with embeddings for input and output), but it’s a reasonable first approximation to treat them as discrete.

(Or FSDP everything)

This does presume quite a lot of scale, as the cost of getting everything working well together is non-trivial. FSDP (preferably FSDP2 in PyTorch at least) directly mixes up between the DP and Everything Else buckets, and works pretty well for most folks up to a decently large (100s+) number of GPUs. So roughly:

Smallish model, lots of data: Distributed data parallel.
Medium sized model, medium sized number of GPUs: FSDP2
Anything expensive: units of <parallelisms>, chunked into pipelines if the model is too big, copied into multiple DP copies to speed up training.

A very lazy mental model for parallelisms

Data parallelism

Everything else

Pipeline parallelism

(Or FSDP everything)

More posts

The elusive order of things

Loss Exploded.

Unbundling Work

Native DSLs Ops in PyTorch

Discover more from Ian’s Blog