Depth Up-Scaling

#7
by mrfakename - opened

Hi,
Amazing model! Do you plan to open source your methods?

This is a fantastic model, could you please share how did you build it, maybe a post or a paper.

Thanks

upstage org

We will be submitting them to arXiv shortly. Thank you for your interest!

hunkim changed discussion status to closed

Thank you! Excited to see it!

Hi @hunkim , is this basically a merge of Mistral and Llama, trained on more tokens? Were the original Llama weights used, and if so, does the license apply?

upstage org

@mrfakename

We use the first and last 24 layers and initialize the Mistral weights. (Yes, the license applies.) Then, we continue pre-training the depth-upscaled model.
image.png

In some sense, this can be considered depth upscaling, while we see MOE as width upscaling.
image.png

@hunkim Thanks a lot for all this info. I have two questions:

  1. Did you use the first 24 layers from LLMa and the last 24 layers from the mistral in the final merged model? Is there any logic behind selecting the order?
  2. What is the dataset you have used?

Thanks in advanced :).

Sign up or log in to comment