AI & ML interests

Collection of JS libraries to interact with the Hugging Face Hub

Recent Activity

huggingfacejs's activity

merveย 
posted an update 1 day ago
Xenovaย 
posted an update 7 days ago
view post
Post
1940
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
๐Ÿš€ Faster and more accurate than Whisper
๐Ÿ”’ Privacy-focused (no data leaves your device)
โšก๏ธ WebGPU accelerated (w/ WASM fallback)
๐Ÿ”ฅ Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web
merveย 
posted an update 8 days ago
view post
Post
2355
Aya by Cohere For AI can now see! ๐Ÿ‘€

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B ๐ŸŒฑ works on 8 languages! ๐Ÿ—ฃ๏ธ

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya ๐Ÿ‘
kudos @nahidalam and team
  • 1 reply
ยท
merveย 
posted an update 9 days ago
view post
Post
2920
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models ๐Ÿงถ

โœจ the models come in 1.5B https://huggingface.co./Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co./Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co./Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
โœจ the authors also release a benchmark dataset https://huggingface.co./spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ

Try the demo for best setup here https://huggingface.co./spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐Ÿ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful ๐Ÿ”ฅ
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐Ÿ”ฅ
  • 3 replies
ยท
merveย 
posted an update 14 days ago
view post
Post
1672
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document ๐Ÿ““
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 ๐Ÿ”ฅ https://huggingface.co./learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
julien-cย 
posted an update 15 days ago
view post
Post
7614
After some heated discussion ๐Ÿ”ฅ, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co./docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐Ÿ”ฅ

cc: @reach-vb @pierric @victor and the HF team
ยท
Xenovaย 
posted an update 17 days ago
view post
Post
2545
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! ๐Ÿ”ฅ High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. ๐Ÿค— Try it out yourself!

Demo: webml-community/text-to-speech-webgpu
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/text-to-speech-webgpu
Model: onnx-community/OuteTTS-0.2-500M (ONNX), OuteAI/OuteTTS-0.2-500M (PyTorch)
merveย 
posted an update 18 days ago
view post
Post
5508
This week in open-source AI was insane ๐Ÿค  A small recap๐Ÿ•บ๐Ÿป merve/dec-6-releases-67545caebe9fc4776faac0a3

Multimodal ๐Ÿ–ผ๏ธ
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐Ÿ‘
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โœจ
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts

LLMs ๐Ÿ’ฌ
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐Ÿ”ฅ
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! ๐Ÿ”ฅ nearly 8TB pretraining data in many languages!

Image/Video Generation ๐Ÿ–ผ๏ธ
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux

Audio ๐Ÿ”Š
> Indic-Parler-TTS is a new text2speech model made by community
merveย 
posted an update 19 days ago
view post
Post
1502
New InternVL drop with a state-of-the-art 78B vision language model with MIT license ๐Ÿ”ฅ https://huggingface.co./collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks ๐Ÿ‘ Try here OpenGVLab/InternVL
merveย 
posted an update 24 days ago
view post
Post
2630
small but mighty ๐Ÿ”ฅ
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM ๐Ÿซฐ๐Ÿป also with gradient accumulation simulated batch size is 16 โœจ
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work ๐Ÿ’ https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
merveย 
posted an update 24 days ago
view post
Post
2866
Last week we were blessed with open-source models! A recap ๐Ÿ’
merve/nov-29-releases-674ccc255a57baf97b1e2d31

๐Ÿ–ผ๏ธ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model ๐Ÿ’—
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents ๐Ÿค–
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model๐Ÿ“–
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT ๐Ÿ“•

๐Ÿ’ฌ LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet ๐Ÿ”ฅ
> AliBaba has released Marco-o1, a new open-source reasoning model ๐Ÿ’ฅ
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)

โฏ๏ธ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res โฏ๏ธ
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla ๐Ÿท๏ธ

Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
julien-cย 
posted an update 26 days ago
view post
Post
2193
wow ๐Ÿ˜ฎ

INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.

PrimeIntellect/INTELLECT-1-Instruct
Xenovaย 
posted an update 28 days ago
view post
Post
3923
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! ๐Ÿคฏ Let's take a look:
๐Ÿ”€ Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text)
๐Ÿ‘๏ธ Qwen2-VL from Qwen for dynamic-resolution image understanding
๐Ÿ”ข JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings
๐ŸŒ‹ LLaVA-OneVision from ByteDance for Image-Text-to-Text generation
๐Ÿคธโ€โ™€๏ธ ViTPose for pose estimation
๐Ÿ“„ MGP-STR for optical character recognition (OCR)
๐Ÿ“ˆ PatchTST & PatchTSMixer for time series forecasting

That's right, everything running 100% locally in your browser (no data sent to a server)! ๐Ÿ”ฅ Huge for privacy!

Check out the release notes for more information. ๐Ÿ‘‡
https://github.com/huggingface/transformers.js/releases/tag/3.1.0

Demo link (+ source code): webml-community/Janus-1.3B-WebGPU
merveย 
posted an update 29 days ago
view post
Post
2163
The authors of ColPali trained a retrieval model based on SmolVLM ๐Ÿค  vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 ๐Ÿ’—
merveย 
posted an update 30 days ago
view post
Post
3871
Small yet mighty! ๐Ÿ’ซ

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient ๐Ÿค 

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO ๐Ÿ’
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO ๐Ÿ’—
merveย 
posted an update about 1 month ago
view post
Post
2580
What a week! A recap for everything you missed โ„๏ธ
merve/nov-22-releases-673fbbcfc1c97c4f411def07
Multimodal โœจ
> Mistral AI
released Pixtral 124B, a gigantic open vision language model
> Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU
> OpenGVLab released MMPR: a new multimodal reasoning dataset
> Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings
> Apple released new SotA vision encoders AIMv2

LLMs ๐Ÿฆ™
> AllenAI dropped a huge release of models, datasets and scripts for Tรผlu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR
> Jina has released embeddings-v3: new multilingual embeddings with longer context
> Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning
> Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs

Image Generation ๐Ÿ–ผ๏ธ
> Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations

Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them ๐Ÿ“š
$ pip install observers
  • 3 replies
ยท
merveย 
posted an update about 1 month ago
view post
Post
1496
Apple released AIMv2 ๐Ÿ a family of state-of-the-art open-set vision encoders
apple/aimv2-6720fe1558d94c7805f7688c
> like CLIP, but add a decoder and train on autoregression ๐Ÿคฏ
> 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448
> Load and use with ๐Ÿค— transformers
merveย 
posted an update about 1 month ago
view post
Post
3105
your hugging face profile now has your recent activities ๐Ÿค—
Xenovaย 
posted an update about 1 month ago
view post
Post
5520
Have you tried out ๐Ÿค— Transformers.js v3? Here are the new features:
โšก WebGPU support (up to 100x faster than WASM)
๐Ÿ”ข New quantization formats (dtypes)
๐Ÿ› 120 supported architectures in total
๐Ÿ“‚ 25 new example projects and templates
๐Ÿค– Over 1200 pre-converted models
๐ŸŒ Node.js (ESM + CJS), Deno, and Bun compatibility
๐Ÿก A new home on GitHub and NPM

Get started with npm i @huggingface/transformers.

Learn more in our blog post: https://huggingface.co./blog/transformersjs-v3
  • 3 replies
ยท