Edit model card

CODE

When Do We Not Need Larger Vision Models?

Model

This is a LLaVA-v1.5-13b model trained with S2-Wrapper, a simple approach to enable any vision model to perceive high-resolution images. We use image resolutions of up to 1008x1008 for this model.

Training

The training pipeline and dataset completely follow LLaVA-v1.5. We use LoRA to fine-tune the model.

Benchmarking

Version Size Schedule Checkpoint VQAv2 VizWiz TextVQA MMMU-val MathVista MM-Bench SEED MM-Vet
LLaVA-1.5 13B full_ft-1e liuhaotian/llava-v1.5-13b 80.0 53.6 61.3 36.4 27.6 67.7 68.2 36.1
LLaVA-1.5 13B lora-1e liuhaotian/llava-v1.5-13b-lora 80.0 58.9 60.2 - - 68.5 - 38.3
LLaVA-1.5-S2 13B lora-1e this model 80.9 56.0 63.1 37.4 27.8 67.9 68.9 36.4

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Downloads last month
18
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.