Mistral-Nemo-NT-Ko-12B-dpo

Description

Mistral-Nemo-NT-Ko-12B-dpo is a shallowly DPO-trained version of werty1248/Mistral-Nemo-NT-Ko-12B-sft.

According to the Hermes 3 Tech Report, DPO made negligible performance improvements in their model. Therefore, I followed the same approach described in the report and applied DPO using LoRA.

LoRA r = 32
Lora alpha = 16
lr = 3e-6
neftune alpha = 5

The datasets used are as follows:

(En) HuggingFaceH4/ultrafeedback_binarized
(Ko, translated from En) sionic/ko-dpo-mix-7k-translation-exclude
(Ko, translated from En) kuotient/orca-math-korean-dpo-pairs
(Zh) zake7749/kyara-chinese-preference-rl-dpo-s0-30K

I've been looking for native Korean/Japanese DPO datasets, but haven't found anything that I'm personally satisfied with(Quantity/Quality).

From each dataset, I sampled a subset based on the score given by the reward model. In the end, I used about 13K samples for training for each language.

Features

The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.
This model works well for multi-turn conversations, and tends to strongly reflect the previous conversation.

Evaluation

LogicKor

Cot-1-shot

모델	방법	추론	수학	글쓰기	코딩	이해	문법	싱글턴	멀티턴	총점
Mistral-Nemo-NT-Ko-12B-sft	cot-1-shot	7.36	6.57	8.71	8.57	9.57	6.43	7.81	7.93	7.87
Mistral-Nemo-NT-Ko-12B-dpo	cot-1-shot	6.79	6.43	9.43	9.79	9.43	5.29	7.71	8.00	7.86
Mistral Nemo	cot-1-shot	5.43	6.86	6.07	7.57	5.86	7.57	7.50	5.62	6.56

1-shot

모델	방법	추론	수학	글쓰기	코딩	이해	문법	싱글턴	멀티턴	총점
Mistral-Nemo-NT-Ko-12B-dpo	1-shot	8.14	5.50	9.36	8.57	9.50	4.71	7.38	7.88	7.63
Mistral-Nemo-NT-Ko-12B-sft	1-shot	9.00	5.71	7.93	8.29	7.93	5.21	7.29	7.40	7.35
Mistral Nemo	1-shot	5.00	6.50	6.86	8.07	7.64	8.43	7.60	6.57	7.08

Default

모델	방법	추론	수학	글쓰기	코딩	이해	문법	싱글턴	멀티턴	총점
Mistral-Nemo-NT-Ko-12B-dpo	default	6.21	5.79	8.00	8.36	9.43	5.43	7.17	7.24	7.20
Mistral-Nemo-NT-Ko-12B-sft	default	6.00	4.93	5.43	7.14	9.71	4.00	6.45	5.95	6.20
Mistral Nemo	default	0.43	7.64	6.21	7.14	6.79	7.21	6.26	5.55	5.90

Language-Confusion

Model	Language	Monolingual-LPR	Monolingual-WPR	Crosslingual-LPR	Crosslingual-WPR
Mistral-Nemo-NT-Ko-12B-dpo	ko	100.00%	97.96%	85.63%	96.93%
Mistral-Nemo-NT-Ko-12B-sft	ko	100.00%	99.00%	87.51%	96.96%
Mistral-Nemo-Instruct-2407	ko	90.72%	93.18%	46.75%	92.84%
Meta-Llama-3.1-8B-Instruct	ko	99.00%	96.97%	91.45%	93.01%
gemma-2-9b-it	ko	100.00%	98.00%	87.93%	95.58%
---	---	---	---	---	---
Mistral-Nemo-NT-Ko-12B-dpo	zh	99.00%	99.50%	80.52%	97.51%
Mistral-Nemo-Instruct-2407	zh	97.50%	98.98%	53.43%	93.58%
---	---	---	---	---	---
Mistral-Nemo-NT-Ko-12B-dpo	ja	100.00%	100.00%	86.89%	95.41%
Mistral-Nemo-Instruct-2407	ja	94.00%	98.94%	50.27%	96.05%

Template

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.

Dataset

zake7749/kyara-chinese-preference-rl-dpo-s0-30K
sionic/ko-dpo-mix-7k-trl-style
kuotient/orca-math-korean-dpo-pairs
HuggingFaceH4/ultrafeedback_binarized

Training Details

GPU: 2xA100
epoch: 1
total batch size: 32
learning rate: 3e-6
neftune_noise_alpha: 5

See axolotl config

axolotl version: 0.4.1

base_model: werty1248/Mistral-Nemo-NT-Ko-12B-sft
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

dpo_beta: 0.1
rl: dpo

datasets:
  - path: werty1248/NT-dpo
    split: train
    type: chatml.prompt_pairs

dataset_prepared_path: /workspace/data/prepared_datasets
output_dir: /workspace/data
save_steps: 500

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: rmsprop
weight_decay: 0.0
learning_rate: 0.000003
lr_scheduler: linear
neftune_noise_alpha: 5

train_on_inputs: false
group_by_length: false

#wandb_project:
#wandb_entity:
#wandb_watch:
#wandb_name:
#wandb_log_model:

bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
flash_attention: true
warmup_steps: 9

eval_steps:
val_set_size: 0
early_stopping_patience:
logging_steps: 1

special_tokens:
  pad_token: <pad>

reward margin

werty1248
/

Mistral-Nemo-NT-Ko-12B-dpo