Finetuning upstage/SOLAR-10.7B-Instruct-v1.0

#24

by bertdirt - opened Jan 16

Jan 16

I have 2 A10 GPUs (total memory 48GB) and I loaded quantised model (size was almost 9GB) and tried finetuning but got "out of memory" error . I loaded the model in the following way :

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 #Changed from bflot16
)

config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    )

model_name = "./SOLAR-10.7B-Instruct-v1.0"

model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quant_config, trust_remote_code=True)

# model.gradient_checkpointing_enable() ## Added checkpointing

model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = get_peft_model(model, config)

To overcome this, I tried adding gradient_checkpointing=True in TrainingAguments:

def train_model(dsl_train,dsl_test,model,tokenizer,output_dir):
    os.environ["WANDB_DISABLED"] = "true"
    model.config.use_cache = False
    trainer = transformers.Trainer(
        model=model,
        train_dataset=dsl_train,
        eval_dataset=dsl_test,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=4,
            gradient_checkpointing=True,
            evaluation_strategy='epoch',
            save_strategy='epoch',
            load_best_model_at_end=True,
            log_level='info',
            overwrite_output_dir=True,
            report_to=None,
            warmup_steps=1,
            num_train_epochs=3,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            save_steps=1,
            output_dir=output_dir,
#             optim='paged_lion_8bit', #"paged_adamw_8bit"
        ),
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    result = trainer.train()
    return result,model,tokenizer

I got the following error:

ERROR - Exception
Traceback (most recent call last):
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_11584/2258950767.py", line 1, in <cell line: 1>
    result,model,tokenizer = train_model(dsl_train,dsl_test,model,tokenizer,output_dir)
  File "/tmp/ipykernel_11584/1227025339.py", line 30, in train_model
    result = trainer.train()
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/transformers/trainer.py", line 2734, in training_step
    self.accelerator.backward(loss)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1851, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I am not aware of what causing this. I tried the changes provide in https://github.com/huggingface/transformers/issues/25006
but this does not work as SOLAR requires updates versions of transformers, torch and accelerate. Please help me in finding the cause to debug this issue.

halilergul1

Feb 25

Hello,

I successfully fine-tuned this model for another task I have been working on recently. I do not think the problem you encounter is due to your GPU because I did with a single GPU with 24GB. The problem you face is possibly because of library configuration issue. Here is my packages, make sure to use a virtualenv and load these:
%pip install -Uqqq pip --progress-bar off
%pip install -qqq torch==2.0.1 --progress-bar off
#!pip install -qqq transformers==4.32.1 --progress-bar off
%pip install git+https://github.com/huggingface/transformers
%pip install -qqq datasets==2.14.4 --progress-bar off
%pip install -qqq peft==0.5.0 --progress-bar off
%pip install -qqq bitsandbytes==0.41.1 --progress-bar off
%pip install -qqq trl==0.7.1 --progress-bar off
%pip install scipy
%pip install accelerate==0.27.2

Hope this helps!

bertdirt

Feb 27

@halilergul1 , I have figured out the issue. I am using use_gradient_checkpointing=False in model = prepare_model_for_kbit_training(model,use_gradient_checkpointing=False) but gradient_checkpointing=True is set to True in TrainingArguments. When I removed use_gradient_checkpointing=False, then it worked.

hunkim changed discussion status to closed Apr 7

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment