TorchFT: Fault tolerant training

https://github.com/pytorch-labs/torchft

The repo from Tristan Rice and Chirag Pandya’s poster at the PyTorch conference has been continually being updated with refinements and improvements.

torchft implements a lighthouse server that coordinates across the different replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop. This allows for membership changes at the training step granularity which can greatly improve efficiency by avoiding stop the world training on errors.

There are lots of clever techniques in here. They center around the idea of having replica groups which serve as the failure domain, rather than the whole training job. In a loose sense, this means that when there is a failure you simply drop the replica group its in and carry on with the rest to the next batch, adding the replica group back in when its recovered.

To make that possible, there’s custom comms that allow for error handling, health monitoring of individual processes, and fast checkpointing that allows recovered workers to be quickly added back to replica sets.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading