The kernel dispatch time eats a lot of performance on GPU – CUDA graphs let you chain a bunch of kernels together, and they’re now more accessible from PyTorch:
CUDA Graphs, which made its debut in CUDA 10, let a series of CUDA kernels to be defined and encapsulated as a single unit, i.e., a graph of operations, rather than a sequence of individually-launched operations. It provides a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduces the launching overheads.
https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/