A very practical exploration of how to get a transformer model going faster for inference:
We in general consider the following as goals for model inference optimization:
- Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
- Reduce the desired computation complexity by lowering the number of FLOPs needed;
- Reduce the inference latency and make things run faster.
https://lilianweng.github.io/posts/2023-01-10-inference-optimization/