Fusion vs. SLERP?

#2
by sometimesanotion - opened

I'm a fan of your cleanly formatted YAML! Did you use the arcee_fusion merge prior to the SLERP, or skip it entirely?

Thanks for reaching out and for the kind words about my YAML formatting. I appreciate the feedback!Regarding your question about whether I used the arcee_fusion merge before SLERP, I actually did try it out. I compared the results of using Fusion with SLERP directly, and I found that the direct SLERP method didn't give me the results I was hoping for.So, before I moved on to li-0.4-slerp0.1, I first used the Fusion method to blend li-0.4 with sthenny-com/misc-14b-0218. This approach seemed to work better for me in the initial tests.If you're interested, I'd be happy to share more details about what I did and why I chose this path. I'm also curious about your thoughts and experiences with these methods.Looking forward to hearing from you!

I have published one model that was a product of three fusions and one SLERP: https://huggingface.co./sometimesanotion/Lamarck-14B-v0.7-Fusion

Fusions appear to be good at capturing interesting differences between closely related models, but SLERP offers much more flexibility in how they are merged. As such I am often using model_stock as a stable base to begin experimental branches that I touch up, and then use fusions and DELLA midway, and use SLERP to merge the branches and TIES to normalize/reinstruct the result. Though I hide numbers that are hazardous to get wrong which could easily break a lot of models if badly copied, the workflow for Lamarck v0.6 shows how to use stable and experimental methods in concert and mix to taste at the end: https://huggingface.co./sometimesanotion/Lamarck-14B-v0.6

Congratulations on your outstanding results, even compared to the solid scores in the models you chose!

Thanks for your congratulations and detailed sharing! In completing this work, I’ve greatly benefited from the contributions of the open-source community—their resources and support have enabled me to continuously optimize and experiment. Your workflow using Fusion, SLERP, DELLA, and TIES sounds very interesting, especially how you balance stable and experimental methods. I’ll check out the Lamarck-14B-v0.7 and v0.6 links. Could you share your principles for choosing weights between Fusion and SLERP when merging closely related models? Looking forward to continuing our discussion!

Likewise! I've learned from the community and that's why I like to document and give credit for the milestones.

I have two leading concepts: that model_stock is a good start for further adjustments, and 30% of a model is usually enough to gain 90% of the performance. I want to find the document about LoRA adapters and their ranks where a rank 512 LoRA, capturing around 30% of the parameters, gave so much of the model. That tells me that merging is a matter of models leaving room for others, and allowing the key parts of each component to express distinctly.

Another key is keeping the early layers as clean and stable as possible. One cleanly fine-tuned model should dominate the first 1/4th of a model, in my opinion, with variety growing midway. Clear signal there reduces hallucinations later.

Those were the first things I internalized from the open source community, and while I'm working on some experiments right now, I'm glad to see how my work is boosting others' projects!

Thank you for your insightful contributions and support to the open-source community! I’m glad to see your work boosting other projects. I completely agree with the idea of using model_stock as a starting point, as well as the importance of LoRA adapters (rank 512) and early layer stability strategies. I’m also experimenting with model merging—could you share your practical experience with LoRA selection or early layer fine-tuning? I’d love to hear more from you!

Here's my suggestion to make model_stock more performant. Often, when combining highly diverged models, it'll throw IFEVAL into a blender. What you need is a minimum of re-established commonality. I like to take rank 16, 32, 64, and 128 adapters from a favored instruct model, and apply it to a majority of models that have gaps there.

Here's a example of that working: https://huggingface.co./sometimesanotion/Qwenvergence-14B-v13-Prose-DS

Thank you for your detailed insights and the resources you shared! I completely resonate with your approach of extracting LoRA adapters at ranks 16, 32, 64, and 128 from an instruction model to rebuild commonality—this is an interesting model. I’ve also faced challenges merging highly divergent models in my experiments. I’m curious—how do you determine the optimal rank for LoRA adapters in different scenarios, or what metrics guide your decisions? I believe our discussion and your contributions are making a significant impact on the community, and I’d appreciate any additional tips you might have.

Thank you! I appreciate what you're doing, too, and how much thought you're putting into crafting open source models.

At some point, none of us have all the data to be completely precise with so many billions of parameters. It goes from science to guesswork and artistry. I don't have all the analysis I should like, but a little has gone a long way.

A single LoRA from a separately diverged model is very likely to break things. In a model_stock, you have a chance to smooth it out with all the averages. I choose the LoRA rank to apply not only from how low the IFEVAL may be in the model being added to the stock, but also to mix the ranks applied; we don't want to apply all rank 64 LoRAs to all models, because that will create a sharp break in the output model. It also is a way of weighting the contribution from ancestor models. Varying the rank of LoRAs allows a smoother gradient, which a SLERP merge at the end enhances further.

While this could use more tests, I think the results speak for themselves. https://huggingface.co./sometimesanotion/Qwenvergence-14B-v13-Prose-DS has a lot of great prose models in its ancestry, but combines them in a way a lot of people clearly appreciate. I'd never have gotten that quality without adding LoRAs.

Eh, hi @wanlige . I'm very pleased that your work has produced such high-quality results. Would you be willing to engage in some exchange and collaboration with us? You can find me on Discord at @prosantos.

Thank you for your support and sharing, sometimesanotion! Your insights (including the rank selection of LoRA adapters, the application of model_stock, and SLERP enhancing smoothness) have been very beneficial to me, especially the results of the Qwenvergence-14B-v13-Prose-DS model. I completely agree with the challenges you mentioned regarding data scarcity, but we aim to transform LoRA fine-tuning from 'guesswork and art' into a more scientific approach. I will select badcases covering multiple domains and specific LoRA tasks (such as IFEval-type tasks) for systematic testing, attempting to establish a more scientific evaluation standard through data and metrics. Additionally, sthennno( @sthenno ), thank you for your congratulations and collaboration invitation! I’m glad to see our work benefiting the community, and I’m willing to engage further on Discord (@gewanli). I look forward to a deeper discussion!

sometimesanotion changed discussion status to closed

Sign up or log in to comment