TRL documentation
Distributing Training
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.15.2).
Distributing Training
Section under construction. Feel free to contribute!
Multi-GPU Training with TRL
Training with multiple GPUs in TRL is seamless, thanks to accelerate
. You can switch from single-GPU to multi-GPU training with a simple command:
accelerate launch your_script.py
This automatically distributes the workload across all available GPUs.
Under the hood, accelerate
creates one model per GPU. Each process:
- Processes its own batch of data
- Computes the loss and gradients for that batch
- Shares gradient updates across all GPUs
The effective batch size is calculated as:
To maintain a consistent batch size when scaling to multiple GPUs, make sure to update per_device_train_batch_size
and gradient_accumulation_steps
accordingly.
Example, these configurations are equivalent, and should yield the same results:
Number of GPUs | Per device batch size | Gradient accumulation steps | Comments |
---|---|---|---|
1 | 32 | 1 | Possibly high memory usage, but faster training |
1 | 4 | 8 | Lower memory usage, slower training |
8 | 4 | 1 | Multi-GPU to get the best of both worlds |
Multi-Nodes Training
Coming soon!
< > Update on GitHub