Fine tuning for a dutch forum.

#21
by Ziizu - opened

Hello there,

Firstly, I want to commend you on your outstanding benchmark scores and the quality of information you generously share.

I am reaching out seeking guidance on the most suitable method for fine-tuning. Before I embark on the fine-tuning process myself, I would appreciate clarification and guidance. I am more than willing to share my results with you in exchange for feedback.

My specific goal is to further fine-tune an e5 model to enhance its capabilities in the Dutch language. I'm working on an RAG system for a forum with over 200k comments across a multitude of topics. I have the capacity to generate weakly supervised datasets from from Dutch news articles, papers, and forums Additionally, I can obtain supervised datasets such as Dutch Squad (https://gitlab.com/niels.rouws/dutch-squad-v2.0) among others.

My specific aim is to train an embedding model to return relevant documents chunks given a query or unlabeled sub-topic. I have the following questions:

  1. How should I go through the fine-tuning process, do you recommend that I go through pre-training + fine-tuning or just do fine-tuning?
  2. If only fine-tuning is required then section 4.2 in your paper, mentions that the MS-MARCO and NQ formats would be ideal formats to convert my corpus into, is my understanding correct and are there any other considerations or factors I should take into account when building my tuning set.
  3. Which model do you recommend I start with? this multi-lingual model or the unsupervised bases?

Hi @Ziizu ,

Thanks for the interest in this model.

About your questions:

  1. How should I go through the fine-tuning process, do you recommend that I go through pre-training + fine-tuning or just do fine-tuning?
    The multilingual-e5-* models have already gone through extensive multilingual pre-training. Just doing fine-tuning should get you a good performance. Going through pre-training + fine-tuning involves more complicated pipelines, and the improvements are likely marginal.

  2. If only fine-tuning is required then section 4.2 in your paper, mentions that the MS-MARCO and NQ formats would be ideal formats to convert my corpus into, is my understanding correct and are there any other considerations or factors I should take into account when building my tuning set.
    Yes, your understanding is correct. You need to prepare the query, positive documents, and hard negative documents for training.

  3. Which model do you recommend I start with? this multi-lingual model or the unsupervised bases?
    Do not use the unsupervised models, they are English-only. Please use multilingual-e5-* models, choose the size (small / base / large) based on your needs.

Best,
Liang

Hi @Ziizu ,

May I ask how you came to choose e5? I have been following it for a while as well. The results on Hugging Face are certainly impressive, but in my testing, I get better results from RoBERTa. What am I missing?
Especially for Retrieval Augmented Generation, I would have thought a larger LLM would be better. If we stay in the Microsoft universe, would PHI-2 (launched at Microsoft Ignite: https://the-decoder.com/microsofts-tiny-but-mighty-phi-2-shows-dramatic-improvements ) be a good alternative?

Thank you,
Lars

Could you share the script to fine tuning multilingual-e5?

Hey @larsskaug

We chose e5 off the benchmark scores too, the other motivating factor was that the fine tuning process was well layed out and also agreeable with the data we have available meaning there's a clear route to improvement for our use case.

At the time of posting, I had not yet methodically bench-marked e5 or other models but I have shared your experience of seeing differences between benchmark scores and real world performance with other models which is the reason we've been developing our own benchmark/test thats reflective of our use case.
We are now in the process of collecting baseline scores on models we're interested in (e5, gte and instructor), after doing so we will attempt to fine tune them and compare again.

Regarding RAG, from my understanding PHI-2 is a generative pre-trained transformer (GPT) model which is primarily used to generate textual outputs, whereas the E5 (and others of the MTEB leader board) are embedding models which generate rich embeddings, these embeddings can be used for a variety of tasks such as information retrieval, semantic textual similarity, text re-ranking, etc. In the context of a RAG system, embedding models are primarily used to or retrieve (i.e. filter and pass) relevant information to a GPT model which then generates a textual response based on information filtered by the embedding model.

We use mistrial and GPT-3.5/4 for our generator but any half decent GPT will give good quality answers, so I'm sure PHI-2 will give sufficient answers for most use cases. The biggest impact on the quality on of answers of an RAG system is the quality of information passed which is dependant on the embedding model.

Feel free to correct anything mentioned as I'm still getting up to speed on this stuff :)

Extremely interesting topic. @Ziizu , as some time has passed have you managed to have any success in you plans?

Am I having some misunderstanding or you plan is to have architecture like: Vector DB - E5-model - GPT-3.5/4 somehow linked together?
Any success in creating this pipeline.
I am using llama-index and for English content it works very well (with Mixtral 8x7b instruct as LLM - I am using hosted version on together.ai and it has unbeatable price/performance), but I have tried with Croatian content and in this setup this is where things fully fall apart.
Looking forward on your experience.
Thanx,
D

Your understanding of my plan is correct, my architecture is a standard Retrieval-Augmented Generation (RAG) pipeline (Vector DB - embedding model - GPT model), you can look it up online to figure out how to link them together.

I have created the pipeline and its work great with the dutch fine tuned embedding model + GPT 3.5/4, however, I have had a similar experience to you where the performance drops significantly when I try to use a Mixtral/Mistral model.

I'm going to attempt to follow this guide on fine-tuning open-source models , the steps should work for you to fine tune a Croatian Mixtral model.

Hello everyone, this subject is really interesting.
I'm asking my self how much data would be needed to further fine tune the model ?

@Ziizu How many pairs did you provided for the further fine tuning and did you achived better performance in dutch without downgrading any other performance (other langage or query/passage similarity performance) ?

In my case I'd like to specialise the model in a specific vocabulary domain, any idea on how many training pairs are needed ? I find it hard to get information on the web.

Have a good day.

Sign up or log in to comment