Upgraded to v1.0!
Analyze image to generate descriptive prompt
a tiny vision language model
Isolate vocals from audio files
Identify languages in text