TIL: TunableOp in PyTorch

I wasn’t aware of this particular autotuning lever! There is a breakdown of TunableOp on the AMD blog from back in July:

https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html

Instead of using the default GEMMs, TunableOp will search for the best GEMMs for your specific environment. It does so by first querying the underlying BLAS library for a list of all solutions for a given GEMM, benchmarking each of them, and then selecting the fastest. TunableOp then writes the solutions to disk, which can then be used on subsequent runs. 

Though the infrastructure is generic, this is effectively an AMD-specific tuning tool right now, as mentioned in the original docs.

Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of PyTorch will function correctly when using TunableOp but the only solution available to CUDA builds is the ‘Default’ implementation i.e. the original cuBLAS default, now called through TunableOp.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading