Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
vicgalle 
posted an update Jan 24
Post
Can you merge models of different sizes? ⚗️

Well, yes, if the models are somewhat compatible. Here is an experiment I did. I wanted to merge two of the best performing models: mlabonne/NeuralBeagle14-7B and jeonsworld/CarbonVillain-en-10.7B-v4

Here is my recipe:
1. Expand the layers of NeuralBeagle to 10.7B ala frankenmerge.
2. DPO-tune the previous model with a high-quality preference dataset, argilla/distilabel-intel-orca-dpo-pairs
3. Merge the previous model with CarbonVillain (needs —allow-crimes in mergekit! 🔪)

And here is the resulting model, CarbonBeagle-11B, which ranked top in the leaderboard for its size class:
vicgalle/CarbonBeagle-11B

thank you for the insight. are there any good tutorials or reading materials you can recommend as far as DPO-tuning?
additionally i'd like to ask if you by chance know anything about merging vision models. i cant find any straight forward documentation on it.

·

For DPO, I just use huggingface's trl library, which makes it very easy. Your dataset only needs to have three fields: "prompt", "chosen" (the preferred completion), and "rejected" (the negative one): https://huggingface.co./docs/trl/dpo_trainer

This is a nice reading for DPO and related variants: https://huggingface.co./blog/pref-tuning
And a slightly more comprehensive tutorial: https://mlabonne.github.io/blog/posts/Fine_tune_Mistral_7b_with_DPO.html

Also, trl is compatible with the peft library for LoRAs, so in my case I can DPO 7B-11B models on just one 24GB VRAM gpu.

Regarding merging vision models, I have not seen any on the hub. Mostly, because the mergekit library seems to be only compatible with text language models? (or at least there are no examples with vision models..) Which is curious, since one of the most cited references for model merging experimented with vision models (experiments with ViT and CLIP): https://huggingface.co./papers/2203.05482