--- license: apache-2.0 language: - en inference: false --- # Model Card for TinyMixtral-x8-Clonebase-7b This model is based on [TinyLlama-1.1B](https://huggingface.co./TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), converted to a mistral model, and then placed the clone in mixtral. **This model was created experimentally for training a small mixtral.** # How it was made First, since tinyllama is an llama model, I converted it to a mistral model. After that, I cloned the FFN part and made it experts. Since they are all the same tensor, the performance does not change. All gates have the same value. # How To Convert use colab cpu-high-memory. This model was created with experts=8, but since it is a clone, you can create as many experts as you like. [tinyllama_to_mixtral_clonebase.ipynb](https://huggingface.co./mmnga/TinyMixtral-x8-Clonebase-7b) # Usage ~~~python pip install transformers --upgrade pip install flash_attn ~~~ ~~~python from transformers import AutoTokenizer, AutoModelForCausalLM, MixtralForCausalLM import torch model_name_or_path = "mmnga/TinyMixtral-x8-Clonebase-7b" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) model = MixtralForCausalLM.from_pretrained(model_name_or_path, device_map="auto") # set num_experts_per_tok 1 or 2 ? model.config.num_experts_per_tok = 2 # message messages = [ {"role": "user", "content": "Tell me what's for dinner tonight."}, ] with torch.no_grad(): token_ids = tokenizer.apply_chat_template(messages, return_tensors="pt") output_ids = model.generate( token_ids.to(model.device), temperature=0.5, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=128, repetition_penalty=1.5 ) output = tokenizer.decode(output_ids[0][token_ids.size(1) :]) print(output) ~~~