Transformer model inference optimizations

Written by

in

A very practical exploration of how to get a transformer model going faster for inference:

We in general consider the following as goals for model inference optimization:

Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
Reduce the desired computation complexity by lowering the number of FLOPs needed;
Reduce the inference latency and make things run faster.

https://lilianweng.github.io/posts/2023-01-10-inference-optimization/

More posts