Previous Rendition:

Evolution 50

This was a minor change with the addition of just this one model.

Add EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0

Added another RP model, this one is based on llama 3.3. Seemed to be a fairly substantial addition to the model's behavior. Prose is a bit different, this is edging on overcompliance but I do enjoy the writing style. The addition of a 3.3 model in general seems to upgrade the instruct as well. I can't identify precisely what I liked about this addition but oh well, it's one I liked and it's probably going to stick around.

Model Architecture

This is a stockmerge model. Thanks mergekit for making this really easy to do.

Base:

huihui-ai/Llama-3.1-Nemotron-70B-Instruct-HF-abliterated

Stock:

huihui-ai/Llama-3.1-Nemotron-70B-Instruct-HF-abliterated
Sao10K/L3.3-70B-Euryale-v2.3
rinna/llama-3-youko-70B
yentinglin/Llama-3-Taiwan-70B-Instruct
meta-llama/Meta-Llama-3-70B
PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B
tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
Bllossom/llama-3-Korean-Bllossom-70B
WhiteRabbitNeo/Llama-3.1-WhiteRabbitNeo-2-70B
TsinghuaC3I/Llama-3-70B-UltraMedical
hitachi-nlp/Llama-3.1-70B-FLDx2
PKU-Alignment/ProgressGym-HistLlama3-70B-C013-pretrain-v0.1
EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0

The rest of this is going to go over both my rationalization for the approach, in the form of a totally unstructured series of rants, including my goal, and the observations I've made.

Selection

The model selection was done using an evolutionary approach where each generation added or removed a model and the outcomes were tested by (me) without the use of any automation or benchmarks.

Why are these models candidates for selection?

The long story short, there's few choices for large trained models that are very different from eachother. Almost all RP models use similar datasets, and often exclusively train on a small distribution of specific data. This often means they differ numerically only very little from their root model, which makes merging less impactful. This leaves the actual selection, which was the most varied models available in the limited selection, and let the evolutionary process decide the rest.

Base Models VS Instruct Tunes

A very important point that is often missed is that instruct is often where the moralizing, dehumanization, and overcompliance comes in. This is not that surpising, instruct is designed to be compliant, and the nature of instruct tunes leads naturally to strong biases in the models vocabulary and personality. In this way, the goal was to use the base model merges like youko and the original llama 3 70B base as regularization to pull back from some of these learned habbits of the instruct models, which are not present (Or perhaps I should say, as prominent) in the base models.

Is Merging even Effective?

I am not a merge believer, but I have to concede the objective fact that model stock works very well. It's the only merging method I've had success with, and my experiments with it continue to impress me.

Why Nemotron Base for RP?

I'll discuss this later in more detail, but roleplay is more than raw knowledge. Having a diverse range, strong vocabulary, and good understanding is important to making the world feel real, where actions have consequences that are understood by the model. I found nemo to have the best capacity to understand (zero shot) implications and hidden meanings. It's even quite good at explaining them if you ask. However, the model has many damage points, it tends to generate in increasingly incoherent ways as time (In terms of rolling over the context window) goes on (Which is fairly common, but quite bad and very obvious in nemotrons case.) With correct sampling parameters, this model performs okay on it's own. However, merging solves the severe degredation problem, improves it's prose, but retains the "smart" qualities of it. I found this to consistently be the best base model for the merge.

Sampling

My personal advice: use less sampling with big models. I use a very low or no repition penalty, from 1-1.03. This is a fairly low amount. I've liked between 0.9-1.5 temp, adding min-p 0.1 might be helpful although I personally like the rare distributions to come up sometimes and lead to wild things. An alternative method is to do 3.0 temp 0.5 min-p and turn temperature last, which will give you near-greedy sampling except when the model is uncertain, you'll need higher rep-pen usually as the model starts to get repitiious if you aren't careful. Finally, almost all sampling setups and samplers just make the model worse. Temperature adds randomness which in roleplay is more interesting, but it will also artifically introduce errors, which then cascade. A simple example is if the character has a red shirt, and you ask the color of their shirt, if your sampling is bad it might force the model to spit out something random, like a maroon shirt. Basically, use the least amount of sampling (as close to greedy as possible) that gives interesting experiences is my recommendation.

The Goal

The overall goal of this model is a more human experience. A good roleplay model has much more nuance than is often discussed, so I'll discuss them.

Knowledge

Knowing things is what llms are good at, really good at. The issue is that they know things in a probability space, and sometimes the context and logits don't let them find the knowledge encoded in the weights in certain conditions. This means paradoxically that LLMs can know and simultaneously not know things, or also know things in multiple different ways. The reality is that people care about the perception of consistency. As long as the model appears to consistently represent knowledge, I call this a win. One of my foremost goals is that the model represents both in-distribution and in-context things in a fairly consistent way.

"Unspoken Understandings"

People sometimes say things directly using indirect language, and rely on the listener to pick up the meaning, these are implications. They can also happen in terms of actions. Consequences share a similar idea. Language models really struggle with both of these categories of things. I would say, this model has a moderate ability to grasp these things, but has not met my standard of consistency. I imagine large book datasets and stories would be the most representative of these ideas, but often times consequences are long-term dependencies in stories, which are even harder to represent in short-window contexts that we currently have. Likely, this will not massively improve without longer context windows and better comprehension over them.

Homogenous Representation, Positions, Locations, Distance

Surpising nobody, the purely textually learned language models with no sensor organs or actual world experience, don't have a great understanding of physics, in a literal sense. Where things are, how they move, all of these things they learn exclusively through text. Models have picked up on some of these, but in a clearly pattern-based way, doing physically simple things often confuses their ability to locate objects, keep positions sensible, and represent object proportions homogenously. These are very immersion breaking problems, as if I swing a sword through the villian, you expect them to be cut. You can see, this combines two problems, an understing of physical actions, and the consequences of it. Luckily, for most tasks these models have picked up some ability to fake their way through, but I have made a mental effort to select for this more than for other features.

Personality, Characterization, Seperability

For a basic, two character roleplay, you expect the one character to maintain a somewhat coherent persona. This is very difficult to shape with language models while keeping all other faculties alive. This can result in dumber characters simply because they're acting in a role. A seperate issue is that introducing more characters tends to blend their personalities together over time, this model is surpisingly good at keeping a few distinct characters, but they must have started distinct. Two somewhat similar characters will almost always blend together after awhile. This is almost certainly fixable with custom types of tunes but is not at all a trivial or easy fix.

The World

World consistency is very very difficult. This is unsolvable in a true sense because of limited context windows and models lack of any other memory abilities. So I'll keep the scope to the world in-context. I've found that among all things, the world tends to be quite accurate in-context provided there's a limited degree of change. This is sort of the case sort most things, if a dragon destroyed the castle, and you had 6K context of in-castle rp, you might find the castle is suddenly back somehow, undestroyed. This again goes back to consequences and consistency, overall I think the world consistency is actually quite good in-context however, and I've found most even small models can do this accurately.

There's a whole lot left unsaid here, but these are the main ideas. When testing models, these are the things I have in the back of my mind. How is the consistency, the world, the personalities, etc. Other things I look for are general prose, and the issues I outlined above with common llm-isms. Selection was made to get the best of these qualities that still feels "human-ish" and interesting.

Why not Fine-Tune instead?

Cost. I've fine over 200 small 7 & 13B models on various datasets, and I've had very good results. However, data selection is very tedious and often you need many attempts to get what you're looking for. I would in fact love to do continued pre-training on the base model with my custom datasets, but the reality is that this is too expensive for me right now.

What else can be done?

Anthropic, the claude people, published this scaling monosemanticity awhile back. (Similar to Abliteration: https://huggingface.co./blog/mlabonne/abliteration) It allows the manipulation of features without classic fine-tunes, via interventions. I've been conducting several experiments to see how these can be used not only to shape the model but also dynamically change it with "feature sliders." Again, though, playing with large models is very expensive.

Blackroot
/

Mirai-70B-2.1