anthracite-core (Anthracite Core)

grimjim

posted an update 24 days ago

Post

2086

This recent paper points to an explanation for the unreasonable effectiveness of Frankenmerges: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2502.05171)

Specifically, the duplication of layers in Frankenmerges serves a purpose similar to what occurs in their recurrent-depth architecture. Successful frankenmerges that operate without additional fine-tuning are able to recover or "heal" from any damage due to abrupt transitions between layer blocks. Operational replicated layer blocks can provide functional benefits grounded in latent reasoning. Frankenmerges can also result in hybrid reasoning, by splicing together the latent reasoning of different models.

Back in April 2024, I was able to duplicate a few layers in the Llama 3 8B model, turning it into a 9B model, without harming benchmarks significantly, despite any transition damage.
grimjim/llama-3-experiment-v1-9B
My informal experimentation suggested that latent reasoning circuits could occupy continguous stacks of 2-4 layers, though the result was highly sensitive to the choice of transition location between layers.

1 reply

·

grimjim

posted an update about 1 month ago

Post

2431

I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably.
grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.

It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.

grimjim

posted an update about 1 month ago

Post

1893

A recent merge has provided another interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Combining an o1 reasoning merge with VAGOsolutions's Llama-3.1 SauerkrautLM 8B Instruct model resulted in a lower IFEval, but a higher result in every other benchmark. This result is currently my best Llama 3.1 8B merge result to date.
grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B
The results suggest that defects in output format and/or output parsing may be limiting benchmark performance of various o1 models.

Nitral-AI

posted an update about 2 months ago

Post

5259

That moment when you spend 5 days up babysitting trains, only for colab pro + to randomly disconnect the environment at every chance with 0 error indication of any kind (it just disconnects without an error). Nuke the session from the interface, but continue to eat my colab credits while it reports to wandb. 0 way of saving the models when this happens since it nukes the code preset up to auto-execute. And since the sessions 'exist' but also at the same time doesn't exist i cant close it. And have to wait till they auto timeout after 24hrs. Guess, i won't be using colab for 'quick' test trains anymore. Thanks google for scheming the very little model training budget i had for the month.

3 replies

·

grimjim

posted an update 2 months ago

Post

1680

I've arrived at an interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard
After I narrowed down the filter of models to be between 8-9B parameters, my recent merge of o1 reasoning models achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795.
grimjim/HuatuoSkywork-o1-Llama-3.1-8B

Unfortunately, I need more information to evaluate the parent models used in the merge.
The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board.
Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.

grimjim

posted an update 2 months ago

Post

2790

I'm (finally) releasing a Python script that trims excess weights in Gemma2 full-weight models that bloated by ~1B parameters due to an early mergekit bug.
https://github.com/jim-plus/Gemma2-mergekit-remediation

I'd noticed something was off when merges of Gemma2 9B models ended up having ~10B parameters. The current mergekit package is fine, but there are still bloated models on HF that could stand to be fixed.

The script assumes that it will be run from the same directory as the model weights, and will trim the unnecessary lm_head.weight tensor and corresponding index entry.

2 replies

·

grimjim

posted an update 3 months ago

Post

1428

A reminder that literal base models are valid choices for base model in task arithmetic mergers. Each Instruct or fine-tuned model then becomes a vector against the base model. Example merge formula used can be found via this model page.
grimjim/Magnolia-v3-12B

grimjim

posted an update 3 months ago

Post

1080

Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct

grimjim

posted an update 5 months ago

Post

2156

To demonstrate that it was possible, I performed a "trapezoid" gradient merge of a Llama 3 8B model onto Llama 3.1 8B Instruct, favoring the L3.1 model at the ends in order to preserve coherence and limiting the influence of the L3 model to at most 0.1 weight. Tested to 16k context length.
grimjim/Llama-Nephilim-Metamorphosis-v2-8B

grimjim

posted an update 6 months ago

Post

1976

I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".

7 replies

·

grimjim

posted an update 6 months ago

Post

3252

I found this paper to be thought-provoking: "Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling" by Bansal, Hosseini, Agarwal, Tran, and Kazemi.
https://arxiv.org/abs/2408.16737
The direct implication is that smaller models could be used to create cost-effective synthetic datasets. And on that note, in the Gemma terms of use, Google explicitly claims no rights on outputs generated from those models, which means one is free to synthgen from the Gemma line. Meta's Llama 3 licence forbids synthetic generation of outputs if used to improve other models. Relevant Mistral, Qwen, and Yi models under the Apache 2.0 license are unrestricted for this purpose.

2 replies

·

grimjim

posted an update 7 months ago

Post

1648

This merge, this time grounded in Gemma2 9B Instruct fine-tunes, is another demonstration that models without any fine-tuning to support roleplay can still perform the function, maintaining coherence and attention to context. It should be evident that no overt fine-tuning is required for roleplay in text generation; pretraining should provide models with a requisite basic understanding of the world, so all that should be needed is some corrective fine-tuning to address observed defects in portraying the world along with datasets to promote a suitably entertaining writing style. Good Instruct tuning should promote reasoning, coherence, and attention to context.
grimjim/Kitsunebi-v1-Gemma2-8k-9B
grimjim/Kitsunebi-v1-Gemma2-8k-9B-GGUF

I opted not to incorporate the UCLA SPPO fine-tune for Gemma2 9B after observing context confusion occur with some frequency during complex scenarios.

Thanks to Axcxept co., ltd. for fine-tuning HODACHI/EZO-Common-9B-gemma-2-it, and to Princeton NLP Group for fine-tuning princeton-nlp/gemma-2-9b-it-SimPO.
AXCXEPT/EZO-Common-9B-gemma-2-it
princeton-nlp/gemma-2-9b-it-SimPO

grimjim

posted an update 7 months ago

Post

2835

I've observed that the layers targeted in various abliteration notebooks (e.g., https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing ) appear to be arbitrary, reflecting probable brute-force exploration. This doesn't need to be the case.

Taking a cue from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" ( https://arxiv.org/abs/2403.17887 ) and PruneMe (https://github.com/arcee-ai/PruneMe), it seems reasonable to target deeper layers identified as more redundant given measured similarity across layers, as the result should be less damaging to models, reducing the need for subsequent fine-tuning. Intuitively, one should expect the resulting intervention layers to be deep but not final. The only uncertainty is if the redundancy successfully encodes refusals, something which is almost certainly model-dependent. This approach only requires the redundancy to be computed once per model, and the result used as a starting point for which layer range to restrict intervention to.

grimjim

posted an update 7 months ago

Post

4192

I've come across theoretical justification for my prior experimentation with extremely low-weight mergers: they amount to flattening a model so its "massive activation" features remain as significant contributors. Extremely low-weight merge weights also effectively sparsify a contributing model with regard to the base model, but in a way which still preserves relationships within the flattened latent space. In the paper "Massive Activations in Large Language Models", the authors observed "very few activations exhibit significantly larger values than others (e.g., 100,000 times larger)", which in turn implies a lower bound in effective application of extremely low weight merging.
https://arxiv.org/abs/2402.17762

1 reply

·

grimjim

posted an update 8 months ago

Post

2308

In principle, it's possible to "abliterate" refusals in any Llama 3.1 8B models via application of a LoRA, using only mergekit.

Proof of concept below:
grimjim/Llama-3.1-8B-Instruct-abliterated_via_adapter

5 replies

·

grimjim

posted an update 8 months ago

Post

2763

Intelligence is all you need for roleplay.
Roleplay is overlooked as a special case of chain-of-thought, where context must be attended to and inferred state of the world and embodied minds must be persisted and evolved along credible narrative lines. LLMs are also being tasked to function as gamemasters. It's a challenging task which points to potential future benchmarks. The fact that the largest commercial LLMs are adept in generating text for roleplay intuitively implies that model intelligence is sufficient so long as it can generalize properly and pay attention to context without becoming confused.
This recent merge of mine composed using 3 academic fine-tunes, none of which were intended for roleplay, has survived the gauntlet of a Reddit post and appears to be a particularly strong 8B model when it comes to roleplay coherence.
grimjim/llama-3-Nephilim-v3-8B (bf16 weights)
grimjim/llama-3-Nephilim-v3-8B-GGUF (select quants)

1 reply

·

grimjim

posted an update 8 months ago

Post

2276

Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B

3 replies

·

grimjim

posted an update 9 months ago

Post

2689

Uploaded two basic SLERP merges of princeton-nlp/Llama-3-Instruct-8B-SimPO and UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3, alternating the choice of base model, for people to test out and potentially use as merge fuel. (Personally, I am drawn to intelligent and attentive models, hence the experimentation.)

grimjim/Llama-3-Instruct-8B-SPPO-Iter3-SimPO-merge
grimjim/Llama-3-Instruct-8B-SimPO-SPPO-Iter3-merge

grimjim

posted an update 9 months ago

Post

2205

We explore extremely low-weight merger as an alternative to fine-tuning; e.g., weight 1e-4. Merge formula details here:
grimjim/kukulemon-v3-soul_mix-32k-7B

grimjim

posted an update 10 months ago

Post

1691

I propose "merge densification", a style of merger which attempts to transfer the benefits of a denser model to a base model. The model weight in this case is 0.02, which is atypically small for mergers, but high compared to the learning rate used during training. In this case, the expectation is more creative text-generation. More details below:
grimjim/kunoichi-lemon-royale-v3-32K-7B

Anthracite Core

AI & ML interests

anthracite-core's activity

AI & ML interests

Team members 27

anthracite-core's activity