Tony Zhao

tianchez

AI & ML interests

Multimodal Agent, Generative AI

Recent Activity

Organizations

Om AI Lab's profile picture

tianchez's activity

replied to their post 14 days ago
reacted to their post with πŸ‘ 15 days ago
view post
Post
4082
Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1
Β·
New activity in omlab/VLM-R1-Referral-Expression 16 days ago
reacted to their post with ❀️πŸ”₯ 20 days ago
view post
Post
4082
Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1
Β·
reacted to merve's post with πŸ‘ 23 days ago
view post
Post
4691
Your weekly recap of open AI is here, and it's packed with models! merve/feb-14-releases-67af876b404cc27c6d837767

πŸ‘€ Multimodal
> OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context
> AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support
> ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size
> Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding

πŸ’¬ LLMs
A lot of math models!
> Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B
> Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models
> DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math
> LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math

πŸ—£οΈ Audio
> Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings

πŸ–ΌοΈ Vision and Image Generation
> We have ported DepthPro of Apple to transformers for your convenience!
> illustrious-xl-v1.0 is a new illustration generation model
Β·
reacted to their post with πŸš€ 23 days ago
view post
Post
4082
Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1
Β·
posted an update 23 days ago
view post
Post
4082
Introducing VLM-R1!

GRPO has helped DeepSeek R1 to learn reasoning. Can it also help VLMs perform stronger for general computer vision tasks?

The answer is YES and it generalizes better than SFT. We trained Qwen 2.5 VL 3B on RefCOCO (a visual grounding task) and eval on RefCOCO Val and RefGTA (an OOD task).

https://github.com/om-ai-lab/VLM-R1
Β·
upvoted an article about 1 month ago
view article
Article

Replicating DeepSeek R1 for Information Extraction

By Ihor β€’
β€’ 38