Smaller version to ease implementation experiments?

#12
by compilade - opened

Hi. I've worked on implementing Mamba support in llama.cpp before (see https://github.com/ggerganov/llama.cpp/pull/5328), and I'd like to eventually implement support for Jamba too.

However, for my hardware, this model is too big for quick experimentation, so I'd really appreciate it if you'd also release a smaller model with the same architecture. It doesn't need to be good (though some coherency is preferred). Ideally a Jamba model with less than 1B parameters would help a lot with this, if possible.

I second this. Loading the weights take a really long time. Some light version (with pruning?) even if the end results is not effective at all would be great for quick testing iteration.

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co./TechxGenus/Mini-Jamba

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co./TechxGenus/Mini-Jamba

Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset and attn_layer_period keys in config.json differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?

I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co./TechxGenus/Mini-Jamba

Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset and attn_layer_period keys in config.json differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?

Ah, this is because I set expert_layer_offset and expert_layer_period to be the same as attn_layer_offset and attn_layer_period. I wanted to first test the results of using MoE only in the Attention layer when making this version.

I will make a new version with Mamba+MoE, Mamba+MLP, Attention+MoE, Attention+MLP at the same time later.

Hi, we uploaded this version for debugging and development purposes (random weights, no training whatsoever)
https://huggingface.co./ai21labs/Jamba-tiny-random

Sign up or log in to comment