what about the image/video captioning ability compared to other methods, like internvl-2.5, sharegpt4v and so on?
#3 opened 10 days ago
by
menglan
How to only use the text and visual embedding?
1
#2 opened 13 days ago
by
Labmem009