Multi-image input support
#34
by
cynricfu
- opened
Does molmo currently support an input with one text and multiple images?
And how about the interleaved image-text input?
I am also wondering~~
Hey @cynricfu and @Michael34234 Molmo does not support multiple image atm, but it has capability to respond interleaved image-text input.
@amanrangapur obviously knows best, but I wanted to add a couple of thoughts to this convo:
- I have had really success with putting two images side by side into 1 image and having molmo compare them.
example:
- With regard to feeding 2 separate images... I was actually just trying an experiment with this, and it seems to me that it actually works very well?
img1 = Image.open(BytesIO(image1))
img2 = Image.open(BytesIO(image2))
img1 = img1.convert("RGB")
img2 = img2.convert("RGB")
prompt = request.form['prompt'] or "These 2 images are from before and after. Describe the specific differences."
with torch.no_grad():
with torch.autocast('cuda', enabled=True, dtype=torch.bfloat16):
print('Processing inputs')
# process the image and text
inputs = processor.process(
images=[img1, img2],
text=prompt
)
# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
print('Generating outputs')
# generate output; maximum 500 new tokens; stop generation when <|endoftext|> is generated
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=500, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer)
generated_tokens = output[0,inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
its also a bit hallucinate-y, but i am running it at bf16...
@mw44
! That's a great workaround. Molmo currently only processes one image at a time in its official implementation. Interesting that processor
function is concatenating the image embeddings in proper way.