|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
inference: false |
|
--- |
|
# Model Card for TinyMixtral-x8-Clonebase-7b |
|
This model is based on [TinyLlama-1.1B](https://huggingface.co./TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), converted to a mistral model, and then placed the clone in mixtral. |
|
**This model was created experimentally for training a small mixtral.** |
|
|
|
# How it was made |
|
First, since tinyllama is an llama model, I converted it to a mistral model. |
|
|
|
After that, I cloned the FFN part and made it experts. |
|
Since they are all the same tensor, the performance does not change. |
|
All gates have the same value. |
|
|
|
# How To Convert |
|
use colab cpu-high-memory. |
|
This model was created with experts=8, but since it is a clone, you can create as many experts as you like. |
|
|
|
[tinyllama_to_mixtral_clonebase.ipynb](https://huggingface.co./mmnga/TinyMixtral-x8-Clonebase-7b) |
|
|
|
# Usage |
|
~~~python |
|
pip install transformers --upgrade |
|
pip install flash_attn |
|
~~~ |
|
|
|
~~~python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, MixtralForCausalLM |
|
import torch |
|
|
|
model_name_or_path = "mmnga/TinyMixtral-x8-Clonebase-7b" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
model = MixtralForCausalLM.from_pretrained(model_name_or_path, device_map="auto") |
|
|
|
# set num_experts_per_tok 1 or 2 ? |
|
model.config.num_experts_per_tok = 2 |
|
|
|
# message |
|
messages = [ |
|
{"role": "user", "content": "Tell me what's for dinner tonight."}, |
|
] |
|
|
|
with torch.no_grad(): |
|
token_ids = tokenizer.apply_chat_template(messages, return_tensors="pt") |
|
output_ids = model.generate( |
|
token_ids.to(model.device), |
|
temperature=0.5, |
|
do_sample=True, |
|
top_p=0.95, |
|
top_k=40, |
|
max_new_tokens=128, |
|
repetition_penalty=1.5 |
|
) |
|
output = tokenizer.decode(output_ids[0][token_ids.size(1) :]) |
|
print(output) |
|
|
|
~~~ |