Generate text responses from images and text input
a tiny vision language model
Generate text based on input prompts