Dense Grounded Understanding of Images and Videos
Generate spatial audio from images (and optionally text)