LVM

This is the model implementation of the CVPR 2024 'Sequential Modeling Enables Scalable Learning for Large Vision Models'. (https://arxiv.org/abs/2312.00785)

LVM is a vision pretraining model that converts various kinds of visual data into visual sentences and performs next-token prediction autoregressively. It is compatible with both GPU and TPU.

You can try out the demo here.

LVM is built on top of OpenLLaMA (an autoregressive model) and OpenMuse (a VQGAN that converts images into visual tokens).

This was trained in collaboration with HuggingFace. Thanks Victor Sanh for the support in this project.

Key Differences from the Original Paper Version

  1. We are currently releasing the 7B model (previously 3B). Additional model size variants will be available soon.

  2. Deep filtering (including quality filters, deduplication, and known CSAM content removal) has been applied to the LAION dataset, reducing the dataset size from 1.5B to 1.2B images.

  3. The tokenizer has been improved for better performance.

License

LVM is licensed under the Apache 2.0 License.

Citation

If you found LVM useful in your research or applications, please cite our work using the following BibTeX:

@article{bai2023sequential,
  title={Sequential modeling enables scalable learning for large vision models},
  author={Bai, Yutong and Geng, Xinyang and Mangalam, Karttikeya and Bar, Amir and Yuille, Alan and Darrell, Trevor and Malik, Jitendra and Efros, Alexei A},
  journal={arXiv preprint arXiv:2312.00785},
  year={2023}
}
Downloads last month
558
Safetensors
Model size
6.54B params
Tensor type
FP16
ยท
Inference Examples
Inference API (serverless) has been turned off for this model.

Space using Emma02/LVM_ckpts 1