QianYEee commited on
Commit
6582ce4
·
verified ·
1 Parent(s): 8a096e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ## INF-MLLM2: High-Resolution Image and Document Understanding
5
+
6
+ In INF-MLLM2, we have introduced significant updates, particularly in high-resolution image processing, document understanding and OCR.
7
+ The key improvements include the following:
8
+ - Dynamic Image Resolution Support: The model now supports dynamic image resolution up to 1344x1344 pixels.
9
+ - Enhanced OCR Capabilities: The model has significantly improved OCR capabilities, enabling robust document parsing, table and formula recognition, document layout analysis, and key information extraction.
10
+ - Advanced Training Strategies: We employed a progressive multi-stage training strategy along with an enhanced data mixup strategy tailored for image and document multitask scenarios.
11
+
12
+ <p align="center">
13
+ <img src="docs/model.png" alt="" width="100%"/>
14
+ </p>
15
+
16
+ [Technical Report](docs/tech_report.pdf)
17
+
18
+ ### Install
19
+
20
+ ```bash
21
+ conda create -n infmllm2 python=3.9
22
+ conda activate infmllm2
23
+ conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.1.2
24
+
25
+ pip install transformers==4.40.2 timm==0.5.4 pillow==10.4.0 sentencepiece==0.1.99
26
+ pip install bigmodelvis peft einops spacy
27
+ ```
28
+
29
+ ### Model Zoo
30
+ We have released the INF-MLLM2-7B model on Hugging Face.
31
+ - [INF-MLLM2-7B](https://huggingface.co/QianYEee/InfMLLM2_7B_chat)
32
+
33
+ ### Evaluation
34
+ The comparison with general multimodal LLM across multiple benchmarks and OCR-related tasks.
35
+ <p align="center">
36
+ <img src="docs/results_1.jpg" alt="" width="90%"/>
37
+ </p>
38
+
39
+ The comparison with OCR-free multimodal LLM for content parsing of documents/tables/formulas.
40
+ <p align="center">
41
+ <img src="docs/results_2.jpg" alt="" width="90%"/>
42
+ </p>
43
+
44
+ The comparison with OCR-free multimodal LLM for key information extraction.
45
+ <p align="center">
46
+ <img src="docs/results_3.jpg" alt="" width="90%"/>
47
+ </p>
48
+
49
+ ### Visualization
50
+
51
+ <p align="center">
52
+ <img src="docs/demo1.png" alt="" width="90%"/>
53
+ </p>
54
+
55
+ <p align="center">
56
+ <img src="docs/demo2.png" alt="" width="90%"/>
57
+ </p>
58
+
59
+ <p align="center">
60
+ <img src="docs/demo3.png" alt="" width="90%"/>
61
+ </p>
62
+
63
+ <p align="center">
64
+ <img src="docs/table_equation.png" alt="" width="90%"/>
65
+ </p>
66
+
67
+ ### Usage
68
+
69
+ The inference process for INF-MLLM2 is straightforward. We also provide a simple [demo.py](demo.py) script as a reference.
70
+
71
+ ```bash
72
+ CUDA_VISIBLE_DEVICES=0 python demo.py --model_path /path/to/InfMLLM2_7B_chat
73
+ ```
74
+
75
+ ## Acknowledgement
76
+
77
+ We thank the great work from [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT.git) and [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer.git).