czczup commited on
Commit
f797792
β€’
1 Parent(s): ca1856d

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -5,6 +5,7 @@ library_name: transformers
5
  base_model:
6
  - OpenGVLab/InternViT-6B-448px-V1-2
7
  - NousResearch/Nous-Hermes-2-Yi-34B
 
8
  base_model_relation: merge
9
  language:
10
  - multilingual
@@ -19,16 +20,20 @@ tags:
19
 
20
  # InternVL-Chat-V1-2
21
 
22
- [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ“œ InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[πŸ“œ InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
23
 
24
  [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
25
 
 
 
 
 
26
  ## Introduction
27
 
28
  We are excited to introduce [πŸ€— InternVL-Chat-V1-2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2). Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
29
 
30
  <p align="center">
31
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png" style="width: 100%;">
32
  </p>
33
 
34
  From the experimental results, we've observed that **a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model.**
@@ -100,7 +105,7 @@ We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
100
 
101
  We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
102
 
103
- > Please use transformers==4.37.2 to ensure the model works normally.
104
 
105
  ### Model Loading
106
 
@@ -455,7 +460,7 @@ print(f'User: {question}')
455
  print(f'Assistant: {response}')
456
  ```
457
 
458
- #### Streaming output
459
 
460
  Besides this method, you can also use the following code to get streamed output.
461
 
@@ -493,6 +498,12 @@ This project is released under the MIT license. Parts of this project contain co
493
  If you find this project useful in your research, please consider citing:
494
 
495
  ```BibTeX
 
 
 
 
 
 
496
  @article{chen2023internvl,
497
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
498
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
 
5
  base_model:
6
  - OpenGVLab/InternViT-6B-448px-V1-2
7
  - NousResearch/Nous-Hermes-2-Yi-34B
8
+ new_version: OpenGVLab/InternVL2_5-38B
9
  base_model_relation: merge
10
  language:
11
  - multilingual
 
20
 
21
  # InternVL-Chat-V1-2
22
 
23
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ“œ InternVL 1.0\]](https://arxiv.org/abs/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://arxiv.org/abs/2404.16821) [\[πŸ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261)
24
 
25
  [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
26
 
27
+ <div align="center">
28
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
29
+ </div>
30
+
31
  ## Introduction
32
 
33
  We are excited to introduce [πŸ€— InternVL-Chat-V1-2](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2). Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
34
 
35
  <p align="center">
36
+ <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
37
  </p>
38
 
39
  From the experimental results, we've observed that **a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model.**
 
105
 
106
  We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
107
 
108
+ > Please use transformers>=4.37.2 to ensure the model works normally.
109
 
110
  ### Model Loading
111
 
 
460
  print(f'Assistant: {response}')
461
  ```
462
 
463
+ #### Streaming Output
464
 
465
  Besides this method, you can also use the following code to get streamed output.
466
 
 
498
  If you find this project useful in your research, please consider citing:
499
 
500
  ```BibTeX
501
+ @article{gao2024mini,
502
+ title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
503
+ author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
504
+ journal={arXiv preprint arXiv:2410.16261},
505
+ year={2024}
506
+ }
507
  @article{chen2023internvl,
508
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
509
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
configuration_internvl_chat.py CHANGED
@@ -38,11 +38,11 @@ class InternVLChatConfig(PretrainedConfig):
38
  super().__init__(**kwargs)
39
 
40
  if vision_config is None:
41
- vision_config = {}
42
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
43
 
44
  if llm_config is None:
45
- llm_config = {}
46
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
47
 
48
  self.vision_config = InternVisionConfig(**vision_config)
 
38
  super().__init__(**kwargs)
39
 
40
  if vision_config is None:
41
+ vision_config = {'architectures': ['InternVisionModel']}
42
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
43
 
44
  if llm_config is None:
45
+ llm_config = {'architectures': ['LlamaForCausalLM']}
46
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
47
 
48
  self.vision_config = InternVisionConfig(**vision_config)
modeling_intern_vit.py CHANGED
@@ -3,6 +3,7 @@
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  from typing import Optional, Tuple, Union
7
 
8
  import torch
 
3
  # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
+
7
  from typing import Optional, Tuple, Union
8
 
9
  import torch