datasets:
- zetavg/ShareGPT-Processed
- zetavg/coct-en-zh-tw-translations-twp-300k
- zetavg/zh-tw-wikipedia
- zetavg/tw-sinica-corpus-word-frequency
- RyokoAI/ShareGPT52K
language:
- zh
- en
TW-Pythia-6.9B-Chat
Taiwanese Mandarin Pythia Language Model, instruction-tuned for dialogue.
Version 0.2
Model Details
The TW-Pythia model is derived from the Apache-2.0-licenced Pythia language model, with 8000 new Traditional Chinese tokens added, embed layers resized and re-trained.
Basics
- Developed by: @zetavg based on EleutherAI's Pythia language model.
- Model type: Transformer-based GPT-NeoX Causal Language Model
- Languages: English, Traditional Chinese
- License: Unknown due to unconfirmed usage license of the training data
- Derived from model: EleutherAI/pythia-6.9b
Model Sources
- Repository: https://github.com/zetavg/twlm
- Demo: See https://hackmd.io/@z/twlm-demo
Uses
Currently, this model has not demonstrated any practical value in Traditional Chinese processing without further training, but it does possess some basic Chinese-English translation capabilities.
Training Details
Training Data
- 200k English ↔ Traditional Chinese Sentences from the COCT Database.
- ~8k English and Traditional Chinese mixed ShareGPT data.
Training Procedure
First, we build a BPE tokenizer based on the original Pythia tokenizer with 8000 new Traditional Chinese tokens added.
Then, we resize the embedding layer of the pythia-6.9b
model to accommodate the new vocabulary size, and we train only the input/output embedding layers to allow the model to learn the new Traditional Chinese words and phrases.
At last, LoRA weights are added to the model and fine-tuned for instruction following.
Training Hyperparameters
- Training regime:
fp32
- See: https://github.com/zetavg/twlm/blob/main/configs/ta01_p7b.yaml
Hardware
- 1xH100 80GB GPU on Lambda Cloud (with Skypilot), about 20h in total.