I'd like to finetune the model for visual grounding task with multiple images. If it's possible, please give me an example. I want to know how to distinguish the bounding box of the first image and the second image.
Pi network
· Sign up or log in to comment