TIL: TunableOp in PyTorch

Written by

I wasn’t aware of this particular autotuning lever! There is a breakdown of TunableOp on the AMD blog from back in July:

https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html

Instead of using the default GEMMs, TunableOp will search for the best GEMMs for your specific environment. It does so by first querying the underlying BLAS library for a list of all solutions for a given GEMM, benchmarking each of them, and then selecting the fastest. TunableOp then writes the solutions to disk, which can then be used on subsequent runs.

Though the infrastructure is generic, this is effectively an AMD-specific tuning tool right now, as mentioned in the original docs.

Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of PyTorch will function correctly when using TunableOp but the only solution available to CUDA builds is the ‘Default’ implementation i.e. the original cuBLAS default, now called through TunableOp.

pytorch

TIL: TunableOp in PyTorch

More posts

Power by the hour

Who is walking who?

MOPD

Benchmarks Mean Business

Discover more from Ian’s Blog