Can you incorporate madlad400 training data ?

#11
by cmp-nct - opened

I've just ran a test with Googles (apache-2) madlad400 10B model and it's flawless in german grammar.
Nothing I tried on any model produced such good translations, leo-hessianai is one of the best but regularly translates mistakes at a beginner level.

madlad runs on an rarely used architecture (t5), is barely supported anywhere and it does not have the capability to translate more than one sentence, so it's not "the solution" to translation. After all the translation of one sentence can depend on what was said in the previous one and madlad totally fails here.
Like "There is a display case." "The case is red" -> "Da ist eine Vitrine" "Der Fall ist rot"

I j ust wondered, given the flawless grammar it produces, maybe the trainind dataset would be a good addition to your LeoLM model.
Or maybe a pure llama based translation model could be created, fine tuned on translation into german (or all languages like madlad appears to handle)?

LAION LeoLM org

Hi and thanks for your interest in our models. The chat models, as well as the Leo base models, have not been trained for translation and this was not an explicit goal. I would assume that finetuning for document level translation would get you much closer to what you are looking for.
From my experience, our Mistral based 7b model works best for this and can be finetuned easily on a consumer GPU with qLora. If you want to go down this path, you should check out the axolotl training library. I believe a few hundred or maybe thousand examples should already be good to align the model.
Good luck with this and feel free to share any thoughts or results :)

Sign up or log in to comment