Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models


Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

arXiv Demo Model Weight


Mini-Monkey is a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive cropping strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090.

TODO

  • Open source code, weight, and data
  • Support training using 3090 GPUs (24Gb video memory)
  • Mini-Monkey with different LLMs

Model Zoo

Mini-Monkey was trained using 8 3090 GPUs on a dataset

Model #param MME RWQA AI2D CCB SEED HallB POPE MathVista DocVQA ChartQA InfoVQA$ TextVQA OCRBench
Mini-Gemini 35B 2141.0 - - - - - - 43.3 - - - - -
LLaVA-NeXT 35B 2028.0 - 74.9 49.2 75.9 34.8 89.6 46.5 - - - - -
InternVL 1.2 40B 2175.4 67.5 79.0 59.2 75.6 47.6 88.0 47.7 - - - - -
InternVL 1.5 26B 2187.8 66.0 80.7 69.8 76.0 49.3 88.3 53.5 90.9 83.8 72.5 80.6 724
DeepSeek-VL 1.7B 1531.6 49.7 51.5 37.6 43.7 27.6 85.9 29.4 - - - - -
Mini-Gemini 2.2B 1653.0 - - - - - - 29.4 - - - - -
Bunny-StableLM-2 2B 1602.9 - - - 58.8 - 85.9 - - - - - -
MiniCPM-V-2 2.8B 1808.6 55.8 62.9 48.0 - 36.1 86.3 38.7 71.9 55.6 - 74.1 605
InternVL 2 2B 1876.8 57.3 74.1 74.7 70.9 37.9 85.2 46.3 86.9 76.2 58.9 73.4 784
Mini-Monkey (ours) 2B 1881.9 57.5 74.7 75.5 71.3 38.7 86.7 47.3 87.4 76.5 60.1 75.7 802

Environment

conda create -n minimonkey python=3.10
conda activate minimonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey/project/mini_monkey
pip install -r requirements.txt

Install flash-attn==2.3.6:

pip install flash-attn==2.3.6 --no-build-isolation

Alternatively you can compile from source:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

Evaluate

We use VLMEvalKit repositories for model evaluation.

Inference

We provide an example of inference code here

Train

Prepare Training Datasets

Inspired by InternVL 1.2, we adopted a LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with InternVL 1.2.

First, download the annotation files and place them in the playground/opensource/ folder.

Second, download all the images we used.

Then, organize the data as follows in playground/data:

playground/
β”œβ”€β”€ opensource
β”‚   β”œβ”€β”€ ai2d_train_12k.jsonl
β”‚   β”œβ”€β”€ chartqa_train_18k.jsonl
β”‚   β”œβ”€β”€ docvqa_train_10k.jsonl
β”‚   β”œβ”€β”€ dvqa_train_200k.jsonl
β”‚   β”œβ”€β”€ geoqa+.jsonl
β”‚   β”œβ”€β”€ llava_instruct_150k_zh.jsonl
β”‚   └── synthdog_en.jsonl
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ ai2d
β”‚   β”‚   β”œβ”€β”€ abc_images
β”‚   β”‚   └── images
β”‚   β”œβ”€β”€ chartqa
β”‚   β”‚   β”œβ”€β”€ test
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   └── val
β”‚   β”œβ”€β”€ coco
β”‚   β”‚   └── train2017
β”‚   β”œβ”€β”€ docvqa
β”‚   β”‚   β”œβ”€β”€ test
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   └── val
β”‚   β”œβ”€β”€ dvqa
β”‚   β”‚   └── images
β”‚   β”œβ”€β”€ llava
β”‚   β”‚   └── llava_pretrain
β”‚   β”‚       └── images
β”‚   β”œβ”€β”€ synthdog-en
β”‚   β”‚   └── images
β”‚   β”œβ”€β”€ geoqa+
β”‚   β”‚   └── images

Execute the training code:

sh shell/minimonkey/minimonkey_finetune_full.sh

Citing Mini-Monkey

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{huang2024mini,
  title={Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models},
  author={Huang, Mingxin and Liu, Yuliang and Liang, Dingkang and Jin, Lianwen and Bai, Xiang},
  journal={arXiv preprint arXiv:2408.02034},
  year={2024}
}

Copyright

We welcome suggestions to help us improve the Mini-Monkey. For any query, please contact Dr. Yuliang Liu: [email protected]. If you find something interesting, please also feel free to share with us through email or open an issue.

Downloads last month
313
Safetensors
Model size
2.21B params
Tensor type
BF16
Β·
Inference API
Unable to determine this model's library. Check the docs .