Questions regarding the training config
Hi, I have few questions regarding your training config:
Training Details
- Sequence Length: 4096
- Training Duration: Approximately 5 days on 2x3090Ti
- Epochs: 1 epoch training for minimized repetition sickness
- RS-QLORA+: 64-rank 64-alpha, resulting in ~2% trainable weights
- Learning Rate: 0.00001
- Gradient accumulation: Very low 32 for better learning.
How did you fit a 70B model on 3090ti gpus?
Does RS-QLORA+ mean : rslora + qlora + lora+? What lora+ ratio did you use?
Does 32 accumulations means 64 total batch size with packed inputs (256k tokens/batch)?
Thank you
How did you fit a 70B model on 3090ti gpus?
Yes you can fit a 70B model on 2x3090Ti using FSDP with offloading. Requires at least 150GB or so of system RAM.
Does RS-QLORA+ mean : rslora + qlora + lora+? What lora+ ratio did you use?
Yes. Used lora+ ratio of 16.
Does 32 accumulations means 64 total batch size with packed inputs (256k tokens/batch)?
No I used 16 gradient accumulation steps and micro batch size of 1, which means 32 total batch size.
Interesting, thank you.
Do you think the improvements over v1.2 come more from rslora or lora+?
Interesting, thank you.
Do you think the improvements over v1.2 come more from rslora or lora+?
You're welcome! We used LoRA+ before too so the changes are using RSLoRA and the gradient accumulation fix in transformers. Also setting a lower LoRA alpha in order to suit RSLoRA.