Questions regarding the training config

by ccdv - opened Nov 22

ccdv

Nov 22

Hi, I have few questions regarding your training config:
Training Details

Sequence Length: 4096
Training Duration: Approximately 5 days on 2x3090Ti
Epochs: 1 epoch training for minimized repetition sickness
RS-QLORA+: 64-rank 64-alpha, resulting in ~2% trainable weights
Learning Rate: 0.00001
Gradient accumulation: Very low 32 for better learning.

How did you fit a 70B model on 3090ti gpus?
Does RS-QLORA+ mean : rslora + qlora + lora+? What lora+ ratio did you use?
Does 32 accumulations means 64 total batch size with packed inputs (256k tokens/batch)?

Thank you

OwenArli

Arli AI org Nov 23

How did you fit a 70B model on 3090ti gpus?

Yes you can fit a 70B model on 2x3090Ti using FSDP with offloading. Requires at least 150GB or so of system RAM.

Does RS-QLORA+ mean : rslora + qlora + lora+? What lora+ ratio did you use?

Yes. Used lora+ ratio of 16.

Does 32 accumulations means 64 total batch size with packed inputs (256k tokens/batch)?

No I used 16 gradient accumulation steps and micro batch size of 1, which means 32 total batch size.

ccdv

Nov 24

Interesting, thank you.
Do you think the improvements over v1.2 come more from rslora or lora+?

OwenArli

Arli AI org Nov 24

Interesting, thank you.
Do you think the improvements over v1.2 come more from rslora or lora+?

You're welcome! We used LoRA+ before too so the changes are using RSLoRA and the gradient accumulation fix in transformers. Also setting a lower LoRA alpha in order to suit RSLoRA.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment