TRL documentation

Distributing Training

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.15.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Distributing Training

Section under construction. Feel free to contribute!

Multi-GPU Training with TRL

Training with multiple GPUs in TRL is seamless, thanks to accelerate. You can switch from single-GPU to multi-GPU training with a simple command:

accelerate launch your_script.py

This automatically distributes the workload across all available GPUs.

Under the hood, accelerate creates one model per GPU. Each process:

  • Processes its own batch of data
  • Computes the loss and gradients for that batch
  • Shares gradient updates across all GPUs

The effective batch size is calculated as: Batch Size=per_device_train_batch_size×num_devices×gradient_accumulation_steps \text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps}

To maintain a consistent batch size when scaling to multiple GPUs, make sure to update per_device_train_batch_size and gradient_accumulation_steps accordingly.

Example, these configurations are equivalent, and should yield the same results:

Number of GPUs Per device batch size Gradient accumulation steps Comments
1 32 1 Possibly high memory usage, but faster training
1 4 8 Lower memory usage, slower training
8 4 1 Multi-GPU to get the best of both worlds

Multi-Nodes Training

Coming soon!

< > Update on GitHub