Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
## INF-MLLM2: High-Resolution Image and Document Understanding
|
5 |
+
|
6 |
+
In INF-MLLM2, we have introduced significant updates, particularly in high-resolution image processing, document understanding and OCR.
|
7 |
+
The key improvements include the following:
|
8 |
+
- Dynamic Image Resolution Support: The model now supports dynamic image resolution up to 1344x1344 pixels.
|
9 |
+
- Enhanced OCR Capabilities: The model has significantly improved OCR capabilities, enabling robust document parsing, table and formula recognition, document layout analysis, and key information extraction.
|
10 |
+
- Advanced Training Strategies: We employed a progressive multi-stage training strategy along with an enhanced data mixup strategy tailored for image and document multitask scenarios.
|
11 |
+
|
12 |
+
<p align="center">
|
13 |
+
<img src="docs/model.png" alt="" width="100%"/>
|
14 |
+
</p>
|
15 |
+
|
16 |
+
[Technical Report](docs/tech_report.pdf)
|
17 |
+
|
18 |
+
### Install
|
19 |
+
|
20 |
+
```bash
|
21 |
+
conda create -n infmllm2 python=3.9
|
22 |
+
conda activate infmllm2
|
23 |
+
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.1.2
|
24 |
+
|
25 |
+
pip install transformers==4.40.2 timm==0.5.4 pillow==10.4.0 sentencepiece==0.1.99
|
26 |
+
pip install bigmodelvis peft einops spacy
|
27 |
+
```
|
28 |
+
|
29 |
+
### Model Zoo
|
30 |
+
We have released the INF-MLLM2-7B model on Hugging Face.
|
31 |
+
- [INF-MLLM2-7B](https://huggingface.co/QianYEee/InfMLLM2_7B_chat)
|
32 |
+
|
33 |
+
### Evaluation
|
34 |
+
The comparison with general multimodal LLM across multiple benchmarks and OCR-related tasks.
|
35 |
+
<p align="center">
|
36 |
+
<img src="docs/results_1.jpg" alt="" width="90%"/>
|
37 |
+
</p>
|
38 |
+
|
39 |
+
The comparison with OCR-free multimodal LLM for content parsing of documents/tables/formulas.
|
40 |
+
<p align="center">
|
41 |
+
<img src="docs/results_2.jpg" alt="" width="90%"/>
|
42 |
+
</p>
|
43 |
+
|
44 |
+
The comparison with OCR-free multimodal LLM for key information extraction.
|
45 |
+
<p align="center">
|
46 |
+
<img src="docs/results_3.jpg" alt="" width="90%"/>
|
47 |
+
</p>
|
48 |
+
|
49 |
+
### Visualization
|
50 |
+
|
51 |
+
<p align="center">
|
52 |
+
<img src="docs/demo1.png" alt="" width="90%"/>
|
53 |
+
</p>
|
54 |
+
|
55 |
+
<p align="center">
|
56 |
+
<img src="docs/demo2.png" alt="" width="90%"/>
|
57 |
+
</p>
|
58 |
+
|
59 |
+
<p align="center">
|
60 |
+
<img src="docs/demo3.png" alt="" width="90%"/>
|
61 |
+
</p>
|
62 |
+
|
63 |
+
<p align="center">
|
64 |
+
<img src="docs/table_equation.png" alt="" width="90%"/>
|
65 |
+
</p>
|
66 |
+
|
67 |
+
### Usage
|
68 |
+
|
69 |
+
The inference process for INF-MLLM2 is straightforward. We also provide a simple [demo.py](demo.py) script as a reference.
|
70 |
+
|
71 |
+
```bash
|
72 |
+
CUDA_VISIBLE_DEVICES=0 python demo.py --model_path /path/to/InfMLLM2_7B_chat
|
73 |
+
```
|
74 |
+
|
75 |
+
## Acknowledgement
|
76 |
+
|
77 |
+
We thank the great work from [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT.git) and [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer.git).
|