Abstract
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
Community
PaliGemma 2 paper is here!
Are there no -mix models trained on a mixture of tasks as part of this release, like with PaliGemma1? These were the most popular variants of PaliGemma1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- H2OVL-Mississippi Vision Language Models Technical Report (2024)
- Linear Chain Transformation: Expanding Optimization Dynamics for Fine-Tuning Large Language Models (2024)
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts (2024)
- Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts (2024)
- Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning (2024)
- Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model (2024)
- Multimodal Instruction Tuning with Hybrid State Space Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks for the great work!
A question about Table 7 (PaliGemma 2's accuracy on VSR) in the paper. Are the PaliGemma2 models finetuned on the VSR train split? Are VSR train set in the training data (no matter pretraining data or finetuning data) of PaliGemma2 models?
Models citing this paper 31
Browse 31 models citing this paperDatasets citing this paper 0
No dataset linking this paper