Fine-tuning Phi-3-vision on custom dataset fails
Hello, thank you for this incredibly powerful model.
I'm trying to fine-tune Phi3-vision on a custom dataset using LoRA and using this data collator:
class CustomDataCollator:
def __init__(self, processor):
self.processor = processor
def __call__(self, examples):
texts = []
images = []
for example in examples:
question = "sample question text"
answer = "sample answer text"
INST_PREFIX='sample instruction prefix'
messages = [
"role": "user",
"content": f"<|image_1|>\n{INST_PREFIX} {question}"
"role": "assistant",
"content": answer
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_file ="{DATASET_DIR}/{example['image']}")
batch = processor(texts[0], images[0], return_tensors="pt")
labels = batch["input_ids"].clone()
batch["labels"] = labels
return batch
I'm getting the following error during loss calculation which makes me believe there is an issue with the labels
(same as input_ids
Traceback (most recent call last):
File "/root/.../IMMO-Research/src/train/", line 201, in <module>
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/", line 1885, in train
return inner_training_loop(
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/transformers/", line 3264, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/accelerate/utils/", line 822, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/accelerate/utils/", line 810, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/amp/", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-vision-128k-instruct/dbcdaaacf52c8e40cf8de6d6ffa6ff6860e5f256/", line 1332, in forward
loss = loss_fct(shift_logits, shift_labels)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/modules/", line 1185, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/opt/conda/envs/phi3/lib/python3.12/site-packages/torch/nn/", line 3086, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
On further debugging, I see that the processor output (input_ids/labels) has a lot of -1
tokens which might be causing issues when the labels are fed into the cross entropy loss. Cross entropy loss expects values > 0 upto num_classes - 1 (or config.vocab_size - 1 in this case).
How do I fix this issue? Is there something I'm missing?
On a side note, it would be great if you could provide a fine-tuning script for Phi-3-vision-128k-instruct!
I tried the above approach from past few days. The processor function is not optimised to handle batch images, texts and the output it generates for two examples is also very high in memory.
@samyak24jain did you fix the error??
@WilliamSotoM It guess it should
Thanks @bdytx5 ! This is helpful.
I used the code from blog on dataset preparation and combined it with peft lora but I am getting below error when training using trainer fucntion
Have added link to download the dataset file(mars_dataset.csv) and original dataset is available on hugging face :-
RuntimeError Traceback (most recent call last)
<ipython-input-8-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()
7 frames
/usr/local/lib/python3.10/dist-packages/transformers/ in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1910 hf_hub_utils.enable_progress_bars()
1911 else:
-> 1912 return inner_training_loop(
1913 args=args,
1914 resume_from_checkpoint=resume_from_checkpoint,
/usr/local/lib/python3.10/dist-packages/transformers/ in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2209 step = -1
-> 2210 for step, inputs in enumerate(epoch_iterator):
2211 total_batched_samples += 1
/usr/local/lib/python3.10/dist-packages/accelerate/ in __iter__(self)
452 # We iterate one batch ahead to check when we are at the end
453 try:
--> 454 current_batch = next(dataloader_iter)
455 except StopIteration:
456 yield
/usr/local/lib/python3.10/dist-packages/torch/utils/data/ in __next__(self)
629 # TODO(
630 self._reset() # type: ignore[call-arg]
--> 631 data = self._next_data()
632 self._num_yielded += 1
633 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.10/dist-packages/torch/utils/data/ in _next_data(self)
673 def _next_data(self):
674 index = self._next_index() # may raise StopIteration
--> 675 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
676 if self._pin_memory:
677 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/ in fetch(self, possibly_batched_index)
52 else:
53 data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)
/usr/local/lib/python3.10/dist-packages/transformers/data/ in default_data_collator(features, return_tensors)
91 if return_tensors == "pt":
---> 92 return torch_default_data_collator(features)
93 elif return_tensors == "tf":
94 return tf_default_data_collator(features)
/usr/local/lib/python3.10/dist-packages/transformers/data/ in torch_default_data_collator(features)
152 if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
153 if isinstance(v, torch.Tensor):
--> 154 batch[k] = torch.stack([f[k] for f in features])
155 elif isinstance(v, np.ndarray):
156 batch[k] = torch.tensor(np.stack([f[k] for f in features]))
RuntimeError: stack expects each tensor to be equal size, but got [1523, 656, 3] at entry 0 and [583, 571, 3] at entry 1
Following is the code used to do peft lora based finetuning :-
I used the code from blog on dataset preparation and combined it with peft lora but I am getting below error when training using trainer fucntion
Have added link to download the dataset file(mars_dataset.csv) and original dataset is available on hugging face :-
from google.colab import drive
!pip install -q git+
!pip install -q accelerate datasets peft bitsandbytes flash_attn
# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
from transformers import TrainingArguments, Trainer
from peft import LoraConfig
import torch
import pandas as pd
import numpy as np
DEVICE = "cuda:0"
# Define model ID
checkpoint = "microsoft/Phi-3-vision-128k-instruct"
# Load processor
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
lora_config = LoraConfig(
target_modules=["q_proj", "k_proj", "v_proj"],
# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
model_name = checkpoint.split("/")[1]
from import Dataset, DataLoader, random_split
processor = AutoProcessor.from_pretrained(checkpoint, trust_remote_code=True)
tokenizer = processor.tokenizer
# Custom Dataset for Mars Images
class MarsProductDataset(Dataset):
def __init__(self, dataframe, tokenizer, max_length, image_size):
self.dataframe = dataframe
self.tokenizer = tokenizer
self.tokenizer.padding_side = 'left'
self.max_length = max_length
def __len__(self):
return len(self.dataframe)
def __getitem__(self, idx):
row = self.dataframe.iloc[idx]
text = f"<|user|>\n<|image_1|>What is shown in this image?<|end|><|assistant|>\nCaption: {row['short_caption']}<|end|>"
image_path = row['local_image_path']
# Tokenize text
encodings = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length)
# Load and transform image
image ="RGB")
image = self.image_transform_function(image)
except (FileNotFoundError, IOError):
# Skip the sample if the image is not found
return None
encodings['pixel_values'] = image
#encodings['price'] = row['full_price']
return {key: torch.tensor(val) for key, val in encodings.items()}
def image_transform_function(self, image):
image = np.array(image)
return image
# Code to prepare the dataset-
# # Function to download an image from a URL and save it locally
# def download_image(image_url, save_path):
# try:
# response = requests.get(image_url)
# response.raise_for_status() # Check if the request was successful
# image =
# return True
# except Exception as e:
# print(f"Failed to download {image_url}: {e}")
# return False
# # Load the dataset from Hugging Face
# dataset = load_dataset('Magneto/image_for_mars')
# # Convert the Hugging Face dataset to a Pandas DataFrame
# df = dataset['train'].to_pandas()
# import os
# import pandas as pd
# from tqdm import tqdm
# # Create directories to save the dataset and images
# dataset_dir = '/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/data/mars_dataset'
# images_dir = os.path.join(dataset_dir, 'images')
# os.makedirs(images_dir, exist_ok=True)
# # Filter out rows where image download fails
# filtered_rows = []
# for idx, row in tqdm(df.iterrows(), total=len(df), desc="Downloading images"):
# image_url = row['image_url']
# image_name = f"{idx}.jpg"
# image_path = os.path.join(images_dir, image_name)
# if download_image(image_url, image_path):
# row['local_image_path'] = image_path
# filtered_rows.append(row)
# # Create a new DataFrame with the filtered rows
# filtered_df = pd.DataFrame(filtered_rows)
# # Save the updated dataset to disk
# dataset_path = os.path.join(dataset_dir, 'mars_dataset.csv')
# filtered_df.to_csv(dataset_path, index=False)
# print(f"Dataset and images saved to {dataset_dir}")
# Load dataset from disk
# link for the file- ""
dataset_path = '/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/data/mars_dataset/mars_dataset.csv'
df = pd.read_csv(dataset_path)
# Split dataset into training and validation sets
train_size = int(0.998 * len(df))
val_size = len(df) - train_size
train_indices, val_indices = random_split(range(len(df)), [train_size, val_size])
train_indices = train_indices.indices
val_indices = val_indices.indices
train_df = df.iloc[train_indices]
val_df = df.iloc[val_indices]
# Create dataset and dataloader
train_dataset = MarsProductDataset(train_df, tokenizer, max_length=512, image_size=128)
val_dataset = MarsProductDataset(val_df, tokenizer, max_length=512, image_size=128)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1, shuffle=False)
training_args = TrainingArguments(
dataloader_pin_memory = False,
save_total_limit = 3,
output_dir = f"/content/drive/MyDrive/Nasa_Phi3_Vision_Finetuning/{model_name}-Mars-Rover",
eval_steps = 10,
save_steps = 25,
max_steps = 25,
label_names = ["labels"],
load_best_model_at_end = False,
optim = "paged_adamw_8bit",
trainer = Trainer(
eval_dataset=val_loader, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
trainer = Trainer(
eval_dataset=val_loader, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
can someone help? > i have the same error
@bdytx5 @samyak24jain @WilliamSotoM @digitalesingulary @Magneto
You could use this code for fine-tuning the model!
You could use this code. It also has the options to tune img_projector
and vision_model
together like llava-1.6.
Got stuck with the same "The processor function is not optimised to handle batch images, texts ", so have to prepare the dataset in dataloader format.
Thanks a lot for the code, was really helpful,
But while trying to recreate your code results, got error
49 data = self.dataset.__getitems__(possibly_batched_index)
50 else:
---> 51 data = [self.dataset[idx] for idx in possibly_batched_index]
52 else:
53 data = self.dataset[possibly_batched_index]
TypeError: 'DataLoader' object is not subscriptable
i`ve installed all latest version
!pip install -q git+
!pip install -q accelerate datasets peft bitsandbytes flash_attn
Can someone please help me out why trainer function doesn't accept Data Loader as train dataset..
is this due to the version issues ?
@EphronM As you can see here, the Trainer class in huggingface takes dataset as input. I think you should change to dataset again.
Thank you all your interest in Phi-3 Vision model.
You may want to try the official finetuning recipe