Solar-Ko-Recovery-11B 🌟❤️‍🩹

Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

Model Details

Model Developers: Junbum Lee (Beomi)

Variations: Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	Training Data	Parameters	Content Length	GQA	Tokens	Learning Rate
Solar-Ko-Recovery	A curated mix of Korean+English Corpora	11B(10.99B)	4k	O	>100B*	5e^-5

NOTE: 2-step training processed

Only Embedding layer and LM Head layer are trained

Full params trained

Vocab Expansion

Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.

Model Name	Vocabulary Size	Description
Original Solar	32000	Sentencepiece BPE
solar-1-mini-tokenizer	64000	Sentencepiece BPE. Added Ko/JP vocabs

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

SOLAR-10.7B: 26 tokens
Solar-Ko-Recovery: 7 tokens

Model	Tokens
SOLAR-10.7B	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']`
Solar-Ko-Recovery	`['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']`

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

SOLAR-10.7B: 22 tokens
Solar-Ko-Recovery: 22 tokens

Model	Tokens
SOLAR-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`
Solar-Ko-Recovery	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean

Used EleutherAI's lm-evaluation-harness
5-shot scores

Tasks	Metric	Value		Stderr
haerae	acc_norm	0.7874	±	0.0118
- haerae_general_knowledge	acc	0.5000	±	0.0378
- haerae_history	acc	0.8723	±	0.0244
- haerae_loan_word	acc	0.8402	±	0.0283
- haerae_rare_word	acc	0.8346	±	0.0185
- haerae_standard_nomenclature	acc	0.8301	±	0.0305
kmmlu_direct	exact_match	0.4205	±	0.0026
- kmmlu_direct_accounting	exact_match	0.3700	±	0.0485
- kmmlu_direct_agricultural_sciences	exact_match	0.3140	±	0.0147
- kmmlu_direct_aviation_engineering_and_maintenance	exact_match	0.3870	±	0.0154
- kmmlu_direct_biology	exact_match	0.3510	±	0.0151
- kmmlu_direct_chemical_engineering	exact_match	0.3910	±	0.0154
- kmmlu_direct_chemistry	exact_match	0.4000	±	0.0200
- kmmlu_direct_civil_engineering	exact_match	0.4010	±	0.0155
- kmmlu_direct_computer_science	exact_match	0.6520	±	0.0151
- kmmlu_direct_construction	exact_match	0.3080	±	0.0146
- kmmlu_direct_criminal_law	exact_match	0.3100	±	0.0328
- kmmlu_direct_ecology	exact_match	0.4660	±	0.0158
- kmmlu_direct_economics	exact_match	0.5385	±	0.0439
- kmmlu_direct_education	exact_match	0.6200	±	0.0488
- kmmlu_direct_electrical_engineering	exact_match	0.3000	±	0.0145
- kmmlu_direct_electronics_engineering	exact_match	0.4740	±	0.0158
- kmmlu_direct_energy_management	exact_match	0.3560	±	0.0151
- kmmlu_direct_environmental_science	exact_match	0.2980	±	0.0145
- kmmlu_direct_fashion	exact_match	0.4470	±	0.0157
- kmmlu_direct_food_processing	exact_match	0.3690	±	0.0153
- kmmlu_direct_gas_technology_and_engineering	exact_match	0.3000	±	0.0145
- kmmlu_direct_geomatics	exact_match	0.3820	±	0.0154
- kmmlu_direct_health	exact_match	0.5700	±	0.0498
- kmmlu_direct_industrial_engineer	exact_match	0.3830	±	0.0154
- kmmlu_direct_information_technology	exact_match	0.6090	±	0.0154
- kmmlu_direct_interior_architecture_and_design	exact_match	0.5440	±	0.0158
- kmmlu_direct_korean_history	exact_match	0.3800	±	0.0488
- kmmlu_direct_law	exact_match	0.4670	±	0.0158
- kmmlu_direct_machine_design_and_manufacturing	exact_match	0.3960	±	0.0155
- kmmlu_direct_management	exact_match	0.5030	±	0.0158
- kmmlu_direct_maritime_engineering	exact_match	0.4283	±	0.0202
- kmmlu_direct_marketing	exact_match	0.7460	±	0.0138
- kmmlu_direct_materials_engineering	exact_match	0.4020	±	0.0155
- kmmlu_direct_math	exact_match	0.2867	±	0.0262
- kmmlu_direct_mechanical_engineering	exact_match	0.3490	±	0.0151
- kmmlu_direct_nondestructive_testing	exact_match	0.3760	±	0.0153
- kmmlu_direct_patent	exact_match	0.3700	±	0.0485
- kmmlu_direct_political_science_and_sociology	exact_match	0.5300	±	0.0289
- kmmlu_direct_psychology	exact_match	0.4470	±	0.0157
- kmmlu_direct_public_safety	exact_match	0.3520	±	0.0151
- kmmlu_direct_railway_and_automotive_engineering	exact_match	0.3220	±	0.0148
- kmmlu_direct_real_estate	exact_match	0.4350	±	0.0351
- kmmlu_direct_refrigerating_machinery	exact_match	0.3240	±	0.0148
- kmmlu_direct_social_welfare	exact_match	0.4970	±	0.0158
- kmmlu_direct_taxation	exact_match	0.3800	±	0.0344
- kmmlu_direct_telecommunications_and_wireless_technology	exact_match	0.5480	±	0.0157
kobest_boolq	acc	0.9202	±	0.0072
	f1	0.9202	±	N/A
kobest_copa	acc	0.8680	±	0.0107
	f1	0.8678	±	N/A
kobest_hellaswag	acc	0.5560	±	0.0222
	f1	0.5520	±	N/A
	acc_norm	0.6540	±	0.0213
kobest_sentineg	acc	0.9824	±	0.0066
	f1	0.9824	±	N/A

Citation

TBD

Acknowledgements

Training support was provided by the TPU Research Cloud program.

beomi
/

Solar-Ko-Recovery-11B