Extract text from images using various OCR modes
Generate animated avatars from images
Video captioning/tracking
4M: Massively Multimodal Masked Modeling