Transformer model inference optimizations

A very practical exploration of how to get a transformer model going faster for inference:

We in general consider the following as goals for model inference optimization:

  • Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
  • Reduce the desired computation complexity by lowering the number of FLOPs needed;
  • Reduce the inference latency and make things run faster.

https://lilianweng.github.io/posts/2023-01-10-inference-optimization/

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading