Nvidia's org: https://huggingface.co./nvidia
Enterprise hub: https://huggingface.co./enterprise
Not from scratch, as our technique perserves most of the model weights. But you have to continue pre-training to get most of the benefits, yes. You can read more about it in our preview paper.
We are in the process of releasing a library for replicating this easily, but are not ready to share this yet.
TLDR:
BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.X+EN
, where X represents the target language and EN
stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.jina-embeddings-v3
(my other thought is that you should increase the KL divergence penalty, if your DPO models diverges too much from your initial model, but I think making the negative examples better is a stronger first step to take)
Not an expert, but I think you should create your negative examples in a way where the first few tokens are not enough to differentiate between good and bad.
One easy way to do this would be to first sample the GPT-4 examples, then keep "n" tokens (sampled from 0 to the length of the answer) then generate the rest of the answer with the other (worse) model.
That way, the DPO model cannot just ignore every tokens after a few, because the branching can happen at any point.