Model Overview

This is a multimodal large language model fine-tuned from Qwen2.5-VL on the R1-Onevision dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding. With its robust ability to perform multimodal reasoning, R1-Onevision emerges as a powerful AI assistant capable of addressing a wide range of problem-solving challenges across different domains.

Training Configuration and Curve

  • Framework: The training process uses the open-source LLama-Factory library, with Qwen2.5-VL-Instruct as the base model. This model comes in three variants: 3B, 7B, and 32B.
  • Parameters: For efficiency, we use a resolution of 518 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch.

The training configuration is as follows:

image_resolution: 518
cutoff_len: 8192
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 1.0e-5

num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
flash_attn: fa2
Downloads last month
13
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Daemontatox/R1_v_7b

Finetuned
(75)
this model

Dataset used to train Daemontatox/R1_v_7b