THE THREAD OF DOOM

#12
by jukofyork - opened

Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(

jukofyork pinned discussion

Okay, I was wondering if we crossed some sort of line.

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.

I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.

Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0 ones to avoid a lot of the confusion.


image.png

image.png

It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b and creative-writer-v0.2-35b models will be going as soon as I get the v1.0 version uploaded, and possible Dusk-Miqu-70B if they do set a hard-limit (I still think Dark-Miqu-70B is worth keeping whatever though).


Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B myself!

:( Damn there was some good info in that thread.

If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.

Unfortunately, I cleaned my browser tabs up about an hour ago.

And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.

I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.

@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.

I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol

I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...

There is a snapshot on the wayback machine:

http://web.archive.org/web/20241130014411/https://huggingface.co./jukofyork/creative-writing-control-vectors-BETA-v0.1/discussions/2

but it looks like the "click to expand" stuff stopped it getting backed up properly?

The mistralai/Mistral-Large-Instruct-2407 fine-tune is cooking and should be ready in around 9-10 days.

This is going to be good. Mistral-Large is very tolerant of projects like this.

@jukofyork

Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?

Control-Vector question: how much VRAM is needed to train vectors for Wizard2-8x22b? I vaguely recall in the lost thread you were using 3 x ?

Around 5/8ths of 140GB. I could train everything up to 70B-72B using a single A6000, but the larger models needed 2x A6000.

Thanks. Ended up managing on a single 94GB H100NVL in the cloud. Looks like it just misses out on an 80gb < 1gb of vram.

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

This seems quite an interesting metric (used in that paper):

Screenshot_20241207-094538.png

From: https://www.sltinfo.com/wp-content/uploads/2014/01/type-token-ratio.pdf

Also: Type-Token Ratios: What do they tell us?

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

I'm so confused now... This literally does the exact opposite of everything I thought was the key to making LLMs write better! I wish they had analysed the names like @ChuckMcSneed experiments!?

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

Yeah, I've been thinking about this too and wonder if a really well curated dataset of "openings" (sentences, paragraphs, chapters, etc) of books/stories might help somewhat with this?

Just checked on the mistral-large fine-tune and it's nearly 1/2 way now and still looking good: at 60% of the way it will switch to a cosine schedule, so fingers crossed it stays this way:

Screenshot_20241207-115133.png

I was a little worried when I saw those big jumps in the max-norm, but it's probably just due to the weird / non-standard hyper-parameters I have to use to increase the Entropy (ie: it can't use any momentum-based optimiser or it overshoots badly, so have to use Adam with beta1 = 0; aka uncentered-RMSprop).

From previous experiments, the Entropy should start to drop slightly now and hopefully end up being approximately the same as the log-loss by the end of training...

Considering I've optimised the hyper-parameters on command-r:35b; this looks pretty hopeful the same will work for all models.

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

The tested repetition within text, but not between different texts generated by the same model. The modern problem of repetition is not that it keeps writing the same slop in one gen, the problem is that when you run multiple gens, you'll get the same fucking slop.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

I think some of this is likely a failure of the associative memory again:

I've been thinking a lot about QwQ and I'm beginning to think the "power" of the model actually comes from being able to approximate higher-order interaction effects from the words it writes.

The associative memory in the transformer architecture (and the Hopfield networks that came before) only really looks at second-order interactions (directly).

Trying to extend the transformer architecture to cubic interactions (and beyond) is totally out of the question as second-order interaction already cost O(n^2).

You can actually approximate higher order interactions to some degree, eg:

SimpleBayesNet.svg.png

https://en.m.wikipedia.org/wiki/Bayesian_network

But it quickly blows up...

So what I think QwQ might be doing is trawling through all the "linked associations" which in turn let it look "further" away from the input context than repeated transformer blocks allow (which can likely only consider a very contrained set of links; likely following a very restrictive pattern too).


So how is this related to creative writing?

Well at the start, the model only really has what you have given it in the prompt to go off, so will likely only have this along with some kind of low-Entropy / pre-baked "template" story (that shows up again and again and again...).

One solution then would be to try to preload the KV-cache with some sort of jumbled up "superimposition" of story prompts, to try to kick-start it away from the boring "template", but I think this will likely be fraught with the model not following your instructions and other "weird shit" due to the randomised input possibility being nothing to do with what you actually want.

So what's the (an) alternative?

Try to start by "asking around" but be very careful to not give away what you actually want to do, eg:

  • What do you know about Joe Abercrombie?
  • What do you know about Rob J Hayes?
  • What do you know about Grimdark fantasy and how is it different to epic fantasy?
  • Let's think about some original settings and character names that might work in a setting like this.
  • Let's now summarise what we have thought about so far.
  • What are we missing here? Can you list some related stuff to consider that we haven't discussed yet?

and so on..

This is exactly what QwQ is doing, but then it finishes off by writing a heap of the worst qwen-slop imaginable! :D

We need to find a way to "pre-load" this higher-order, possibly useful, possibly useless, context into some of the better models.

This method actually has a name in psychology / educational theory, but I've forgotten what it is called now:

Basically the idea is to "prime" the student with something novel/interesting that gets these sort of associations working and creates "anticipation", before actually giving the task...

IIRC, it has "prime" in the name.

I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "

This is exactly what QwQ is doing

I hadn't realized that, but that makes perfect sense.

be very careful to not give away what you actually want to do

Why is that?

I have done something similar to that before back when GPT3.5 came out.
I wrote a bunch of phrases at the start, then said "Oh sorry, wrong window, what I meant to say was: "

This is exactly what QwQ is doing

I hadn't realized that, but that makes perfect sense.

be very careful to not give away what you actually want to do

Why is that?

It's a bit like the "don't think of an elephant" thing: if I start off telling you that we're ultimately gonna be writing "a Grimdark story in the style of..." then all the distant associations you know about are unlikely to be used effectively as you've "framed" the problem for them.

From a human perspective, I think it also likely triggers the "reward centres" more due to a mix of "anticipation" and the "satisfaction" of problem solving.

I don't know anything about psychology (at all) so may be using the wrong terminology; it's just 20+ years ago I worked as a private maths teacher who had to deal with kids excluded from school and often those who had failed to get anywhere with other private teachers too! Needless to say; I read a lot about educational theory those years and even managed to get some to pass their exams that nobody would have thought possible... :/

https://en.m.wikipedia.org/wiki/Priming_(psychology)

I think it is actually just called "priming" but sadly wokeism seems to have corrupted the Wikipedia article:

Priming is thought to play a large part in the systems of stereotyping.


https://www.teachwithmrst.com/post/priming

this is another example of priming, which is an increased sensitivity to a particular schema due to a recent experience. In other words, priming is when an experience or exposure to a stimulus puts a particular schema at the forefront of our mind. When this in turn influences our judgments and decisions, it's called the priming effect.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

I no longer think this is a solvable problem with these models. Ultimately; once trained, they are stateless and have no concept of how often they've produced the same slop over the past 100,000+ inference sessions.

Even if we get the entropy higher while maintaining coherence; at scale, we'll still see new slop patterns emerging.

Even with all the GPT-isms nuked, we'll end up with new -isms. The "Whispering Woods" will become something else. Almost like you need a new model per book, or maybe a bunch of LoRA applied at different scales.

I tested this briefly by corrupting the mlp modules of a model so that it produced weird names for characters, objects (and for some reason, it also caused a temporal displacement in it's general knowledge) then generated a small 3k dataset with the story prompts in the control vectors git repo. I then had gemma-2-2b read them all and list all character names. Ended up with new Elara's like "Vi'tkol" showing up in half the stories lol

Or we all need our own private tunes with Jukeofyork's bespoke technique ^

Have you tried it with base models? Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:

https://openreview.net/forum?id=ZpQ2SqQNXf

but then start to gain way too much entropy as the sequence length increases:

Screenshot_20241207-181943.png

It almost looks like if we could do "late fusion" on the two sets of outputs we would have something close to human generation?!

When my machines finally finish training, then I think I might be able to hack together something that tests this...

I think it will need some heuristics adding to let the instruct model decide when to stop, but otherwise it's just a case of blending the probability outputs before deciding which token to accept.

(I've already experimented lots with merging base/instruct models and/or making MoE models with the gating weights all set to zero, and both are "interesting" but sadly never stop and quickly go completely off the rails by taking to themselves, etc).

Interestingly, this paper (which sadly got lost when I deleted the old thread :/) shows that base models start off well:

You've still got it though right (you linked to it).

I've got a copy which I used to build a tool to replicate the graphs in the paper.

Have you tried it with base models?

Not really, even with few-shot prompting, couldn't get them to reliably produce synthetic data.

Take the good old llama1 or falcon-180b and see if makes slop or not. The problem is instruction tuning.

Okay that was a beast to get running. It doesn't seem to produce gpt-isms, but I notice it re-uses the same names a lot (not Elara but it's own names).

That's what I mean, I think all of these models; once they've been (pre)trained and become stateless weights, will either have their own flavor of slop, or produce noise. Kind of like how we have our own patterns of speech, etc.

P.S. I see they've given us more storage now on HF, and it looks like public repos are free

image.png

So I've been reading up on the "Softmax Bottleneck":

https://arxiv.org/abs/1711.03953

which likely effects all LLMs to some degree (due to having n_vocab >> hidden_dim), but likely effects small LLMs the most:

https://arxiv.org/abs/2404.07647

(possibly one of the reasonsCohere and Mistral-large with their 12k hidden_dim outperform the standard 8k hidden_dim of the 70B models for writing too?)

The "Mixture of Softmax" solution isn't very appealing as the lm_head tensors are already huge...

Another solution people have experimented with is passing the logits through a non-linear function:

https://arxiv.org/abs/1805.10829
https://arxiv.org/abs/1902.08077

Then it occurred to me that we already have an example of a range of models that do this already, why are also quite good at creative writing and appear to "punch above their weight" - gemma2 with their "logit soft capping":

https://arxiv.org/abs/2408.00118

Screenshot_20241211-124332.png

which originally came from this paper:

https://arxiv.org/abs/1611.09940 (Section 5.1, 'RL pretraining')

Interestingly, the "Sigsoftmax" paper above experimented with using the binary sigmoid function:

Screenshot_20241211-124447.png

and found it worked better than their function (which is a sort of "soft leaky RELU") for one if the tests, but concluded capping at 1 was likely problematic...

But the gemma2 models use +/- 30 for their cap:

  "final_logit_softcapping": 30.0,

which when passed through exp(), is well outside the range of floating point values anyway...

So I wonder if the benefit of gemma2's "final logit softcapping" is actually nothing to do with clipping/capping; and simply because it solves the "Softmax Bottleneck" problem to some degree due to the non-linearity it introduces?!

P.S. I see they've given us more storage now on HF, and it looks like public repos are free

image.png

Yeah, I saw that posted on Reddit too. I'm 1 day away from the mistral-large fine tune being ready:

Screenshot_20241211-130018.png

So at least I won't have to delete anything to upload it (I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though).

Pretty excited to see what it is like as 9 days has felt like a long time lol.

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

Yeah, and I think some of the newer models are starting to filter out copyrighted data so they aren't gonna work well even if the slop can be reduced :/

I think qwen-1.5:110b is worth trying, as even though it was in the v1.5 line it came out way after the others, and does seem to not have been "benchmaxxed" as badly as the v2.0 and v2.5 models.

The older v1.5 models also didn't have great long context ability:

Screenshot_20241211-144115.png

https://github.com/NVIDIA/RULER

but I have feeling qwen-1.5:110b was actually more like qwen-2:110b but just named as v1.5...

Before all the gemma2:9b clones took over, it scored fairly high on EQ-Bench:

http://eqbench.com/creative_writing.html

and did appear to do well in the sample "write in the style of" prompts they used to test it (meaning it's unlikely to have had the copyrighted data filtered out).

It also appears to be quite intelligent and actually scored higher than the commercial models when acting as a judge in this paper:

https://arxiv.org/abs/2406.08598v2

I think it will be interesting to see how it turns out anyway.

This paper makes me think merging might be back on the cards too:

https://arxiv.org/abs/2412.06769

and I noticed all the top places in the open-llm-leaderboard:

https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard

appear to be using versions of qwen2:72b and qwen2. 5:72b with around 6 layers self-merged (the authors are very cagey about saying exactly what the method is though...).

I wonder if command-r-plus with the middle 16 (or 24) layers duplicated (taking it up to 80 or 88 layers respectively), might be a worthwhile experiment?

I'm pretty sure the "multiplicative-LoRA" method is ideally suited to fixing a lot of the old weirdness caused by merging, and these middle layers are clearly related to concepts as they were the most important for the control vectors...

The discussion in this thread:

https://huggingface.co./MaziyarPanahi/calme-2.4-rys-78b/discussions/10

Is what makes me believe the "secret sauce" is really just a self-merge...

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

If we can make command-r-plus just a little smarter, then it would be a big win IMO and only take the size up to around the same as mistral-large:123b and still less than wizard-lm-2:140b.

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

@TheDrummer tried making largestral smaller by cutting out "unimportant layers", but it didn't go too well imo. While the vanilla knew all 8 of the styles, the cut down version almost completely forgot one and got worse at writing poems:
image.png

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

image.png
Self-merges wrote better on my tests too.

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

If we can make command-r-plus just a little smarter, then it would be a big win IMO and only take the size up to around the same as mistral-large:123b and still less than wizard-lm-2:140b.

IIRC, @llmixer did some experiments and found deeper models generally wrote better (and he wasn't keen on command-r-plus:104b due to it only having 64 layers compared to the more standard 80 layers of the 70b models? Apologies if it wasn't you!).

I for one would love smarter command r plus. Still one of my favorite writers but its continuity leaves something to be desired

I've decided the next will be command-r-plus:104b (old version) and then after that qwen-1.5:110b.

I can't see any compelling reason to run on the new version of command-r-plus:104b or mistral-large:123b as for creative writing; they both seem like a downgrade...

Enough slop in the new releases to keep the pigs happy...

image.png
Even pigs aren't happy with the new one.

Even pigs aren't happy with the new one.

Because it's worse for non-creative tasks. It's general knowledge is worse than 2407 (same as command-r-plus-08) even though 2411 appears to have the same knowledge cutoff as 2407.

I'm not sure they're trying to remove copyright though, I suspect it's teething issues, the first time Mistral have tried adding a proper system prompt to their template.

I also confirmed myself that the miqu:120b self-merge, although slightly broken; was more capable of solving puzzles...

Was this the one which had random spelling/grammatical errors? I wonder if that could be healed with a very light finetune. I've successfully taught a model I broke how to speak again with a quick r=16,a=32 tune on the mlp modules, using a dataset generated by the original model.

Is what makes me believe the "secret sauce" is really just a self-merge...

Could Vizdiff help you investigate this? https://huggingface.co./spaces/Steelskull/Vis_Diff

I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though

If you just want to tidy up, sure. But public models don't count towards the quota.

Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.

@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?

Is what makes me believe the "secret sauce" is really just a self-merge...

Could Vizdiff help you investigate this? https://huggingface.co./spaces/Steelskull/Vis_Diff

Thanks, I'll have a look at this and see if I can spot what they did.

I am gonna clear out the 2 remaining 35B "experimental" models when it gets uploaded though

If you just want to tidy up, sure. But public models don't count towards the quota.

Yeah, I'm just trying to avoid a lot of the confusion and only have "good" models uploaded.

Took a snapshot of https://archive.is/M8Tr2 to avoid link-rot.

@jukofyork P.S.since llama.cpp server has on the fly lora swapping and scaling (like the control-vector-scaled) with the latest version, and Mistral-Large is huge to store locally, I don't suppose you could upload the LoRA adapter of your Mistral-Large as well like rAIfle did with rAIfle/SorcererLM-8x22b-epoch2-LoRA ?

The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/

I could possibly save just the down_proj tensors using the modules_to_save option, but sadly it won't work with most stuff and I probably am best just uploading the full model.

@ChuckMcSneed Could you check out Endurance v1 & v1.1 to see if finetuning healed it to an extent?

@TheDrummer Will do.

@ChuckMcSneed Could you check out Endurance v1 & v1.1 to see if finetuning healed it to an extent?

Great, who left the door open again?!?! ;D

d9f8c62743d7ac0ca6d1f2709b58bec0.jpg
FLUX thinks demons gotta be phasing through doors to close them.

The problem is that it's a Multiplicative-LoRA so the standard Additive-LoRA code won't work, and even a very high rank SVD still can't capture the full Multiplicative-LoRA :/

All good, this is a special case then. I've cleared up space by deleting the new Mistral-Large and command-r+, other models I don't need.

Looking forward to trying it out!

Bad news guys :(

It seems to have corrupted itself and tried to do an extra step (???) at the end:

GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:38:52,276] [INFO] [logging.py:129:log_dist] [Rank 0] step=1159, skipped=0, lr=[0.0], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:38:52.456] [INFO] [qlora-pipe] step:  1159 /  1159 loss: 1.5680 iter time (s): 622.448 samples/sec: 0.048 eta: 46m41s 
GPU-SERVER-1: before GAS splitting, batch size: 10, total tokens: 81920
GPU-SERVER-1: [2024-12-12 14:49:11,957] [INFO] [logging.py:129:log_dist] [Rank 0] step=1160, skipped=0, lr=[1.1460462221279944e-09], mom=[0.0]
GPU-SERVER-1: [2024-12-12 14:49:12.019] [INFO] [qlora-pipe] step:  1160 /  1159 loss: 8.7767 iter time (s): 618.958 samples/sec: 0.048 eta: 36m18s 

and then crashed....

I tied quantizing this and can confirm it's completely broken (as the loss: 8.7767 indicates).

Even worse is I tried to go back to the step: 1100 snapshot and it turns out two of the ranks have been saving 2 copies (???) at the same time:

GPU-SERVER-1: [2024-12-12 04:26:28,592] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1100 is ready now!
GPU-SERVER-2: [2024-12-12 04:26:28,598] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,602] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:28,841] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_02-model_states.pt.
GPU-SERVER-1: [2024-12-12 04:26:28,854] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt...
GPU-SERVER-2: [2024-12-12 04:26:28,869] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_46-model_states.pt.
GPU-SERVER-2: [2024-12-12 04:26:28,881] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_47-model_states.pt...
GPU-SERVER-1: [2024-12-12 04:26:29,083] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mnt/WORK-PC/creativewriter__finetune/20241203_15-29-49/global_step1100/layer_03-model_states.pt.

so these all seems messed up too :(

I will have to power-cycle all the machines and/or try to investigate what caused this when I get back home, but not much point in redoing it or trying other models until then.

Looks possibly like GPU-SERVER-2 has a broken SSD :(

Meh.

Shit.

Before you reboot:

1160 / 1159 loss: 8.7767
step: 1100

Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least

broken SSD

Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)

I don't suppose you had something like wandb logging your checkpoints?

Shit.

Before you reboot:

1160 / 1159 loss: 8.7767
step: 1100

Do you have a much earlier step like 500? If this sync issue is somehow related to the dead SSD it might have been okay earlier on so it's not all lost at least

broken SSD

Again, before you reboot, it's worth asking Claude/o1 if there's a way to get the data. Years ago I nuked an SSD and I forget what I did, but managed to get something back which was still loaded . Depends on the filesystem though (claude/o1 would know)

I don't suppose you had something like wandb logging your checkpoints?

I think the SDD errors were a red herring and there actually was something wrong with mixing pipeline parallel and batch parallel at the same time.

It seems both rank 0 and rank 1 had been saving over the top of each other the whole run and I never noticed :/

I'm just gonna run on the 30B-ish models which don't use pipeline parallel whilst away and see how they get on... If they are fucked too then something more serious must have gone wrong as I did manage to train endless command-r:35b fine tunes before.

I've also reverted a lot of the fiddling about I did and made a fresh pull of qlora-pipe incase...

If I can't mix pipeline parallel and batch parallel then it's not the end of the world, as I can just run the training 3x and combine all the LoRA using the mean or even SVD (but sadly 9 days --> 27 days).

This might even be the better option as the samples to tunable parameters for the large models is gonna be pretty bad anyway and this would help with overfitting.

Oof sorry 😞

So I've been hunting through the qlora-pipe code to see if I could see where the "extra step" came from (which I think actually ended up with a negative learning rate and hence performed gradient ascent and ruined the model at the end). I didn't manage to find the answer, but I have found a way better method to create the training data, eg:

  1. Extract all paragraphs that are between 200 and 2000 characters (which is ~40-400 words or ~50-500 tokens). This gets rid of all the "dross" like tables of contents, page numbers, etc and leave just nice clean paragraphs.
  2. So now we're left with ~1.1M paragraphs and for each of these, we trim any trailing whitespace and add two new lines (to be consistent with how most LLMs output paragraphs) and then append an <EOS> token.
  3. Randomly shuffle all the 1.1M paragraph + "\n\n" + <EOS> chunks and concatenate them to use as training data.

For example, for Cohere models:

Not unkindly, Mr. Nell told him, "There's two parts to the system. One part carries solid human waste--shit, if I'd not be offendin yer tender ears. The other part carries gray water--water flushed from toilets or run down the drains from sinks and washin-machines and showers; it's also the water that runs down the gutters into the city drains.

<|END_OF_TURN_TOKEN|>The aluminum sled on which Norah was transporting her testing gear resembled an oversized Flexible Flyer. The craft was prepacked with diagnostic gear and safety accessories she'd been using on the glacier over the past few days. All of her gear--including a battery pack, safety flares, and a powerful front-mounted spotlight--was bound under a secured, plastic tarp. Despite the heavy load, the sled glided effortlessly on long, straight runners. Even on the almost imperceptible incline, the sled moved downhill on its own accord, and Norah applied a gentle restraint, almost as if allowing the sled to lead the way. Sensing the distance growing between the group and the habisphere, Tolland looked over his shoulder. Only fifty yards away, the pale curvature of the dome had all but disappeared in the blustery blackness.

<|END_OF_TURN_TOKEN|>He packed a compartmentalized, hand-tooled Mark Cross briefcase with the blue bag, the Green Acres bag, and the tape recorder that he used for dictation. While he waited for the Keanuphobe to call, he would do some game planning and compose a chapter *of Fear Not for l Am with You.*

<|END_OF_TURN_TOKEN|>Well, the word was out. Cancer. Rhymes with *dancer* and You *just shit your pants, sir.* God knew the word had bobbed up in his own mind more than once since getting on the penny scale in front of the shoe store. It had bobbed up like some evil clown's dirty balloon and he had turned away from it. He had turned away from it the way you turned away from the bag ladies who sat rocking back and forth in their strange, sooty little nooks outside the Grand Central Station or the way you turned away from the capering Gypsy children who had come with the rest of the Gypsy band. The Gypsy children sang in voices that somehow managed to be both monotonous and strangely sweet at the same time. The Gypsy children walked on their hands with tambourines outstretched, held somehow by their bare dirty toes. The Gypsy children juggled. The Gypsy children put the local Frisbee jocks to shame by spinning two, sometimes three of the plastic disks at the same time - on fingers, on thumbs, sometimes on noses. They laughed while they did all those things, and they all seemed to have skin diseases or crossed eyes or harelips. When you suddenly found such a weird combination of agility and ugliness thrust in front of you, what else was there to do but turn away? Bag ladies, Gypsy children, and cancer. Even the skittery run of his thoughts frightened him.

<|END_OF_TURN_TOKEN|>

(sadly all the work of extracting, shuffling, formatting, etc is done using bash scripts as python was so slow it kept timing out the Deepspeed connection...)

Then we now load in the new dataset files and create batches using this modified version of yield_sequences_from_token_batch:

def yield_sequences_from_token_batch(tokenizer, token_batch, sequence_len):
    """Yields fixed-length sequences from batches of tokens, ensuring proper BOS/EOS token handling.
    
    Takes batches of tokens and yields sequences of fixed length, with each sequence:
    - Starting with BOS token if specified in tokeniser
    - Containing complete chunks terminated by EOS tokens (never splitting between EOS tokens)
    - Right-padded with extra EOS tokens if needed so all reach exactly sequence_len
    """
    sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]
    for tokens in tqdm(token_batch):
        tokens = tokens.tolist()
        assert len(tokens) > 0, "empty token list"
        assert tokens[-1] == tokenizer.eos_token_id, "token lists must end with EOS"

        idx = 0
        # If present, skip the auto-generated BOS token
        if tokenizer.bos_token_id is not None and tokens[0] == tokenizer.bos_token_id:
            idx += 1

        while idx < len(tokens):          
            next_eos_idx = tokens.index(tokenizer.eos_token_id, idx)
            chunk = tokens[idx:next_eos_idx + 1]
            assert len(chunk) <= sequence_len, "chunk exceeds sequence length"
 
            if len(sequence_tokens) + len(chunk) > sequence_len:
                sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
                yield sequence_tokens
                sequence_tokens = [] if tokenizer.bos_token_id is None else [tokenizer.bos_token_id]

            sequence_tokens.extend(chunk)
            idx += len(chunk)

    if len(sequence_tokens) >= sequence_len / 2:
        sequence_tokens.extend([tokenizer.eos_token_id] * (sequence_len - len(sequence_tokens)))
        yield sequence_tokens

Which then gets called like this:

    dataset = dataset.map(lambda x: tokenizer(x['text']), batched=True, batch_size=10, remove_columns=dataset.column_names, desc='tokenizing', num_proc=num_proc)
    dataset = dataset.map(lambda x: {'input_ids': list(yield_sequences_from_token_batch(tokenizer, x['input_ids'], sequence_len))}, batched=True, batch_size=None, remove_columns=dataset.column_names, desc='splitting')
    # Set labels for EOS tokens -100 to exclude them from training gradient calculations
    dataset = dataset.map(
        lambda x: {
            'attention_mask': torch.ones_like(x['input_ids']),
            'labels': torch.where(x['input_ids'] == tokenizer.eos_token_id, torch.full_like(x['input_ids'], -100), x['input_ids'])
        },
        desc='adding attention_mask and labels (with EOS labels set to -100)'
    )

to ensure the <EOS> tokens are attended to, but not used for gradient calculations (which would bias the response lengths of the fine-tuned model).

This also means I can right-pad all the batches up to the desired sequence length using <EOS> tokens.


Transformers only has a 1D attention_mask so I can't do proper sample packing without using this:

https://github.com/MeetKai/functionary/tree/main/functionary/train/packing

BUT: I'm not convinced this is actually beneficial, as during pre-training the LLMs were trained on data that looks just like what I am giving them, eg:

<BOS> sample text 1 <EOS> sample text 2 <EOS>...

and the interference might actually be beneficial and force the fine-tune to concentrate better on each example with the surrounding "noise".


So now we have a dataset format that is sequence length agnostic (eg: large clever models won't get hugely lower/different losses) and no longer biases the reposnse length (due to masking the <EOS> labels for gradient calculations) to be shorter or longer.

We also have much higher entropy training data due to randomised paragraphs to be looked at in isolation (eg: things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...).

BUT: The most exciting possibility is to add some contextual text before each paragraph (or group of paragraphs if it turns out to be needed), such as: the author's name, book title, genre and so on, which can then be masked in the same way as the <EOS> tokens (in a similar way to instruction tuning "prompt-masking" method). So the model should then be able to learn the association between the contextual meta-data and the style of writing!!!

For the time being I am just going back to using stock cross-entropy loss (ie: no attempt to increase the entropy of the outputs), and just using the 1.1M randomised paragraphs as outlined above to hopefully get something much closer to the "multiplicative control-vectors" that I set out to create right at the start, but the possibilities this new dataset method opens up are huge IMO.

Another benefit of this is that it trains in about 1/2 the time as before, partly due to removing the 40% of the "dross" from the old books files converted to text, but also because I can now increase the batch size right up to the GPU memory limit and not worry that large/smart models with long context can just memorise everything easily; all models should now face the same prediction task, with a similar starting loss regardless of the batch size or their native context length.

I look forward to seeing the result!

So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :

https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=vITh0KVJ10qX

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

Extract all paragraphs that are between 200 and 2000 characters

I like this idea, that's actually a really simple way to get rid of the junk.

Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.

So this would also teach the model to end every turn with \n\n

things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...

I've read your post a few times, but I'm not understanding why/how this part would work?

So to make sure I understand, you're essentially doing the equivalent of this "train on completions" prompt-masking like unsloth support, but since there's no instruction prompt, you're only masking the :

https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=vITh0KVJ10qX

space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

Yeah, setting the label to -100 like this causes it to get set to the same value as is used for "causal masking" which means it gets ignored for the loss calculations, but still get used for the attention mechanism (the attention_mask can be used for padding tokens to both ignore for the gradient calculation and make the tokens effectively "invisible" to the attention mechanism, but that's not what we want here).

Extract all paragraphs that are between 200 and 2000 characters

I like this idea, that's actually a really simple way to get rid of the junk.

Yeah, I found you can go smaller but the 50-100 character paragraphs in isolation give so little leading context that they aren't likely to be very useful, and by choosing ~200 characters you 100% remove all the useless junk like page numbers, tables of content, etc.

The reason for setting an upper limit is that things like markdown quotations using > characters can create long run-on "paragraphs" that are really several paragraphs joined.

Randomly shuffle all the 1.1M paragraph + "\n\n" + chunks and concatenate them to use as training data.

So this would also teach the model to end every turn with \n\n

I'm hoping it will just learn to end every paragraph with \n\n as it's not actually getting any loss calculated for the following <EOS> token and it should just appear similar to training on larger texts that the model just happens to only see the first paragraph of.

things like names are only high-entropy when you first encounter them; after seeing the name(s) at the start of a story they become low-entropy for the remainder of the sequence...

I've read your post a few times, but I'm not understanding why/how this part would work?

Imagine I give you several chapters of a book to read. If you learn the protagonist is called "Tom" in chapter 1 then the point where you learn his name there could be a huge range of possible names (very high entropy), but as soon as you know his name is "Tom" then the range of valid names drops to just a single possibility (very low entropy).

If these several chapters can fit in a context of 16k or 32k tokens then each time you are about to generate the name "Tom" you aren't really going to get any gradient information from it as the model will be near 100% correct.

On the other hand if you mix these same chapters up with 1000 other books' chapters, and then force the model to look at just a single paragraph (or possibly handful of paragraphs) then the model will be left guessing much more and have to use the very sparse preceeding context to guess the valid range of names based on whatever clues it can glean from it (ie: locale, sex, other nouns, etc).

This is quite an interesting article on prompt masking / prompt weighting:

https://towardsdatascience.com/to-mask-or-not-to-mask-the-effect-of-prompt-tokens-on-instruction-tuning-016f85fd67f4

(just open in an incognito tab if it won't show - it's pretty rare I ever find anything useful one Medium, but this is one rare case)

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

I should know by tomorrow if it has any potential, as currently training on top of command-r:32b (new version) which is more prone to sloppy writing...

I just need to be careful of overfitting though, as 40% of my data has been pruned away and now only have around ~100M tokens, and even a rank-16 LoRA on command-r:32b is ~10M trainable parameters... I don't want to reject this method thinking it's broken, but later find it was because of overfitting! So back to using a more conservative rank, lora_dropout and weight_decay to hopefully mitigate the chance of this.

It is definitely learning something:

Screenshot_20241216-224828.png

but will likely be very conservative changes to the output if it isn't broken.

I've just noticed some interesting stuff about the Cohere tokeniser:

https://huggingface.co./CohereForAI/c4ai-command-r-v01/blob/main/tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<PAD>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<UNK>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<CLS>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<SEP>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "4": {
      "content": "<MASK_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "5": {
      "content": "<BOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<EOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<EOP_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255000": {
      "content": "<|START_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255001": {
      "content": "<|END_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255002": {
      "content": "<|YES_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255003": {
      "content": "<|NO_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255004": {
      "content": "<|GOOD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255005": {
      "content": "<|BAD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255006": {
      "content": "<|USER_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255007": {
      "content": "<|CHATBOT_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255008": {
      "content": "<|SYSTEM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
  "bos_token": "<BOS_TOKEN>",
  "eos_token": "<|END_OF_TURN_TOKEN|>",

They used an actual <EOS_TOKEN> (and <EOP_TOKEN>) token during pre-training, but then it got switched to "eos_token": "<|END_OF_TURN_TOKEN|>" during fine-tuning.

Also the use of <CLS>, <SEP> and <MASK> during pre-training likely means it was trained (at least partly) using non-causal data (ie: like BERT where it gets to see the future tokens and has to fill in the masked/middle tokens):

https://huggingface.co./docs/transformers/en/main_classes/tokenizer

https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/

It looks like llama3 might have done something similar with its tokeniser:

<|end_of_text|>

Model will cease to generate more tokens. This token is generated only by the base models.

<|eom_id|>

End of message. A message represents a possible stopping point for execution where the model can inform the executor that a tool call needs to be made. This is used for multi-step interactions between the model and any available tools. This token is emitted by the model when the Environment: ipython instruction is used in the system prompt, or if the model calls for a built-in tool.

<|eot_id|>

End of turn. Represents when the model has determined that it has finished interacting with the user message that initiated its response. This is used in two scenarios:

at the end of a direct interaction between the model and the user
at the end of multiple interactions between the model and any available tools

This token signals to the executor that the model has finished generating a response.


This makes me wonder if we can still use these tokens for fine-tuning if we set the labels to -100?

I'm gonna test using each of these:

  • <SEP>
  • <EOP_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <EOS_TOKEN>
  • \n + \n + <|END_OF_TURN_TOKEN|>

to deliminate the paragraphs (with the label set to -100), and see what it does to the losses for command-r:32b (I'm currently running \n + \n + <|END_OF_TURN_TOKEN|>).

I don't think using <EOS_TOKEN> or <|END_OF_TURN_TOKEN|> without any new lines prepended makes much sense, but from reading the paper (which I re-linked below after my post above vanished) the use of <EOP_TOKEN> and <SEP> are worth trying.

One of my posts just vanished above, but in it I linked these two:

https://arxiv.org/abs/2004.02251

https://stackoverflow.com/questions/71306070/do-you-need-to-put-eos-and-bos-tokens-in-autoencoder-transformers

and said it look like the Cohere models' order of token ID numbers makes it look like they might have first pre-trained bi-directionally, then pre-trained causally, then finally fine-tuned.

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

Well, if I can get this working properly then I think it should work with smaller models too:

I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.

I'm gonna test using each of these:

  • <SEP>
  • <EOP_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <EOS_TOKEN>
  • \n + \n + <|END_OF_TURN_TOKEN|>

to deliminate the paragraphs (with the label set to -100), and see what it does to the losses for command-r:32b (I'm currently running \n + \n + <|END_OF_TURN_TOKEN|>).

After reading the paper I linked above about the use of <SEP> and <EOP_TOKEN>:

The most important observation is that, without EOP, the beginning of the generation is more relevant to the end of the input prompt, but the more it generates, the poor quality is. While the generator with EOP can generate multiple paragraphs related to the input with a reasonable ending but each paragraph is more independent than human writings.

(see Appendix B too)

Added to the fact that my paragraphs are all seen in isolation and randomised; I think actually the only ones I need to try now are:

  • <EOS_TOKEN>
  • \n + <EOS_TOKEN>
  • \n + \n + <EOS_TOKEN>

and:

  • <|END_OF_TURN_TOKEN|>
  • \n + <|END_OF_TURN_TOKEN|>
  • \n + \n + <|END_OF_TURN_TOKEN|>

It only take around 20 hours per run so can easily test all of these, but it will be harder to compare the evaluation losses between the different new line variants as the models can probably "cheat" and learn the pattern from earlier examples...

and this bit from the paper:

This observation indicates that GPT2 tends not to generate the EOS following the NL even after fine-tuning, but it can learn better EOS with the help of a new EOP token.

make me think that adding the new lines right before the <EOS> token might be a bad idea (but not 100% sure if I'm setting the <EOS> label to -100).

So next I will try <|END_OF_TURN_TOKEN|> and <EOS_TOKEN> (with label set to -100) as these should be easier to compare.

If this works then I'm actually most excited about adding masked metadata before each paragraph, as IMO that has the ability to really start to be useful and goes right back to the original idea of using Multiplicative-LoRAs as a kind of "conditional control vector" that can add a (signed) direction (from lora_B) only if it detects another direction (from the corresponding vector in lora_A).

So the hope would be one or more vectors in lora_A would learn to pick out hidden states that are relevant to a given bit of metadata (like author name, genre, era, and so on) and then bias the output of the LLM using the corresponding vector in lora_B to make it closer to what we want...

It could even be a way to recover some of the ability of newer/better LLMs that have been trained on filtered or synthetic data - most "smart" models still have the ability to write like anyone if they can continue on from a real example.

Glad to see you're giving this model a go for us 24gb and below users :-)

Well, if I can get this working properly then I think it should work with smaller models too:

I think the reason @tdrussell 's "Instruct-Storywriter" method didn't work well on small models is because they got a huge drop in loss compared to larger models, whereas this method of using a bunch of randomised paragraphs gets a similar loss for all models, and big models can't rely so much on already having the stories encoded in their weights.

Mate, that's awesome! Can't wait to see it.

All this is getting way too complicated and it's unclear exactly what the effect of all these different ways of breaking paragraphs are going to have on an instruction-tuned model...

So... I'm just gonna generate my data as before:

Paragraph 1

<EOS>Paragraph 2

<EOS>Paragraph 3

.
.
.
<EOS>Paragraph N-1

<EOS>Paragraph N

<EOS>

Then tokenise this with the <EOS> tokens ensuring each paragraph with the 2 trailing newlines gets tokenised as a whole.

Then use this to just output huge sequences of random paragraphs to train on:

<BOS>Paragraph 1

Paragraph 2

Paragraph 3

.
.
.
Paragraph N-1

Paragraph N

<EOS>
<EOS>
<EOS>

and completely mask out the <EOS> tokens in the same way as <PAD> would be.

It will likely confuse the model somewhat, but may actually be less confusing that attempting to use all these breaking tokens for an instruction-tuned model and the distribution of newlines in real stories should be retained.

(If it does cause the model to not be able to output any special tokens, then I can deal with that by using a second dataset that is passed through the chat template but then mask out everything except the special tokens. Even if the second dataset is full of horrible slop-ridden stories; it will still be able to hopefully fix the frequencies of special tokens if needed....)

It's a bit of a dodgy hack, but I've found a way to avoid screwing up the frequencies of the special tokens:

  "added_tokens_decoder": {
    "0": {
      "content": "<PAD>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<UNK>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<CLS>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<SEP>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "4": {
      "content": "<MASK_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "5": {
      "content": "<BOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "6": {
      "content": "<EOS_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "7": {
      "content": "<EOP_TOKEN>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255000": {
      "content": "<|START_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255001": {
      "content": "<|END_OF_TURN_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255002": {
      "content": "<|YES_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255003": {
      "content": "<|NO_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255004": {
      "content": "<|GOOD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255005": {
      "content": "<|BAD_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255006": {
      "content": "<|USER_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255007": {
      "content": "<|CHATBOT_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255008": {
      "content": "<|SYSTEM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255009": {
      "content": "<|USER_0_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255010": {
      "content": "<|USER_1_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255011": {
      "content": "<|USER_2_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255012": {
      "content": "<|USER_3_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255013": {
      "content": "<|USER_4_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255014": {
      "content": "<|USER_5_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255015": {
      "content": "<|USER_6_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255016": {
      "content": "<|USER_7_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255017": {
      "content": "<|USER_8_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255018": {
      "content": "<|USER_9_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255019": {
      "content": "<|EXTRA_0_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255020": {
      "content": "<|EXTRA_1_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255021": {
      "content": "<|EXTRA_2_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255022": {
      "content": "<|EXTRA_3_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255023": {
      "content": "<|EXTRA_4_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255024": {
      "content": "<|EXTRA_5_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255025": {
      "content": "<|EXTRA_6_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255026": {
      "content": "<|EXTRA_7_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255027": {
      "content": "<|EXTRA_8_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "255028": {
      "content": "<|NEW_FILE|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255029": {
      "content": "<|BEGINNING_OF_PREFIX_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255030": {
      "content": "<|BEGINNING_OF_MIDDLE_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255031": {
      "content": "<|BEGINNING_OF_SUFFIX_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255032": {
      "content": "<|END_OF_MIDDLE_FIM_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "255033": {
      "content": "<|EXTRA_9_TOKEN|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }
  },

by hacking the Triton kernel:



@triton
	.heuristics({
    "DO_LOGIT_SCALING": lambda args: args["DO_LOGIT_SCALING"],
})


@triton
	.jit
def _cross_entropy_backward(
    logits_ptr, logits_row_stride,
    dloss_ptr,   dloss_row_stride,
    logsumexp_ptr,
    labels_ptr,
    VOCAB_SIZE : tl.constexpr,
    BLOCK_SIZE : tl.constexpr,
    DO_LOGIT_SCALING : tl.constexpr,
    LOGIT_SCALE : tl.constexpr,
):
    """
        CE_i = -y log(P) = y * (log[sum(exp(x))] - x)
        dC/dx = d/dx (y * log[sum(exp(x))] - x * y)

        From https://en.wikipedia.org/wiki/LogSumExp
        d/dx logsumexp = exp(x) / sum(exp(x)) = softmax(x)

        dC/dx = y * exp(x) / sum(exp(x)) - d/dx (x * y)
        dC/dx = y * exp[ log[exp(x) / sum(exp(x))] ] using x = exp(log(x)) trick
        dC/dx = y * exp[x - logsumexp] - d/dx (x * y)

        If y == 0: dC/dx = 0
        If y == 1 and x == label: dC/dlabel = exp[x - logsumexp] - 1
        If y == 1 and x != label: dC/dx     = exp[x - logsumexp]
    """
    row_idx   = tl.program_id(0)
    block_idx = tl.program_id(1)

    logits_ptr += row_idx * logits_row_stride.to(tl.int64)
    dloss_ptr  += row_idx *  dloss_row_stride
    col_offsets = block_idx*BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = col_offsets < VOCAB_SIZE
    label_idx = tl.load(labels_ptr + row_idx).to(tl.int32)

    if label_idx != -100:
        dloss = tl.load(dloss_ptr)
    else:
        dloss = 0.0

    x = tl.load(logits_ptr + col_offsets, mask = mask, other = -float("inf")).to(tl.float32)
    if DO_LOGIT_SCALING:
        # d/dx [s * x] = s
        x = LOGIT_SCALE * x
    pass
    logsumexp = tl.load(logsumexp_ptr + row_idx)
    y = tl.exp(x - logsumexp)
    y = tl.where(
        col_offsets == label_idx,
        y - 1.0, # exp(x - logsumexp) - 1
        y,       # exp(x - logsumexp)
    )

    #######################################################
    # Zero out the gradients for the Cohere special tokens.
    y = tl.where(
        (col_offsets <= 7) | (col_offsets >= 255000),
        0.0,
        y,
    )
    #######################################################

    # If y == 0: dC/dx = 0 ==> we already masked it to be = 0, so dloss = 0.
    if DO_LOGIT_SCALING:
        # d/dx [s * x] = s
        y = LOGIT_SCALE * y
    pass
    tl.store(logits_ptr + col_offsets, dloss * y, mask = mask)
pass

so that gradient information isn't backed up for these tokens.

This should fix the problems regarding the frequencies of these going slowly to zero due to having none of them in your training data!

Now I just need to see what happens when we train on these massive files of:

<BOS>paragaph1

paragraph2

paragraph3

I've found out using:

https://huggingface.co./spaces/Xenova/the-tokenizer-playground

that the above tokenises to:

[5, 35, 138854, 37, 2385, 1786, 16599, 24, 206, 206, 95337, 25, 206, 206, 95337, 26]

with 206 being the newlines.

I'm hoping by keeping these newlines we DO actually bias the frequency of these to be closer to actual authors' writing style, but if this fails I also can zero their gradient if needs be.

Fingers crossed this works!

Sorry for the lack of updates, but I have still been progressing slowly with this:

  • I'm still getting the weird "extra step" at the end of every training run now, but unless I use cosine annealed schedule it doesn't seem to make any difference.
  • I've found a much better way to initialise the LoRAs which let me run projected gradient descent on lora_A so it stays on the surface of a unit sphere, and then use weight-decay only on lora_B.

I'll post more details and hopefully the v1.0 of command-r:32b before the new year.

I haven't tested it yet, but the new initialization / optimisation may let me bump the Entropy up even further than I could before, but for now I'm just using stock Cross-Entropy loss and no attempt to increase Entropy until I get the hyper-parameters dialed in properly...

I'm still running on the 1.1M random paragraphs dataset and using the "hack" I posted above to avoid the special tokens getting nerfed:

https://github.com/tdrussell/qlora-pipe/discussions/41

I'll be buggered if I can make this work in pytorch without using 10GB extra VRAM (for no apparent reason - even using "chunking"???), but the Triton kernel modification works...

If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/

Merry Christmas!

2ab07090374e9f9a78cbdf0e304dc8c8.jpg

Merry Christmas @jukofyork @ChuckMcSneed @gghfez and lurkers!

Merry Christmas!

https://huggingface.co./spaces/Xenova/the-tokenizer-playground

This looks useful. I've got a tokenizer issue to investigate myself. I've been using the standard eg:

from transformers import AutoTokenizer
writer_tokenizer = AutoTokenizer.from_pretrained("gghfez/Writer-Large-2411-v2.1")
print(writer_tokenizer.encode("""<BOS>paragaph1

paragraph2

paragraph3"""))

So it looks like for command-r, 206 is 1 linefeed and 2126 is 2 linefeeds.

If anybody has any suggestions I'd be very grateful, as currently this dodgy hack will mean the code needs to be edited for every different model :/

Sorry, what you're doing is beyond my level right now.

Merry Christmas!

Not related to creative writing, but the new QWQ:72B model is insanely impressive:

  1. I gave it an obscure picture of train line map I took at a museum a few months ago: horrible photo, glare reflecting off the perspex in front of it, etc. Then asked it to estimate the date and it absolutely nailed it by looking at the place names, the dates the lines were created and cut, the style of the fonts, and so on!
  2. I gave it a picture of my brother and his wife sitting in front of a waterfall in New Zealand and it looked at the foliage, lighting, water colour and so on to narrow it down and actually got the exact place!
  3. I gave it a picture of my confusing 3-phase electric meter and asked for the reading, and it managed to ignore all the distractions and read the exact value!

I think GeoGuessr will have to start working on their anti-cheat as it's likely better than 99% of the population!!!

Merry Christmas all! Have a great day!

Sign up or log in to comment