Extract text from images using OCR
Combine text and images to generate responses
Transcribe or translate audio from files, microphone, or YouTube