https://github.com/deepseek-ai/DeepSeek-V3/tree/main
The DeepSeek V3 paper is very clearly written and goes into the infrastructure they used to get a very strong model on a relatively small amount of compute.
One example that stood out was the work they did to get efficient utilization on their expert routing.
In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps.
The name of the game in this kind of efficiency work is overlapping comms, and they are doing it at multiple levels: overlapping Infiniband (backend network) calls with NVLink (much faster) calls, using warp specialization to enable concurrency and overlap within a GPU etc. They have some feedback for hardware manufacturers on this:
We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network
co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink
(scale-up) networks from the perspective of the computation units.
Reminds me a little of the specialized comms cores in the Tenstorrent hardware!
They also train in FP8 – getting stable training with FP8 is non-trivial, so having a list of what components they included/excluded is very helpful for other folks that might be interested in similar training:
For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators.
Though, there is some work to do to get the same performance:
One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM.