zetavg's picture
Update README.md
e74ba9f
metadata
datasets:
  - zetavg/ShareGPT-Processed
  - zetavg/coct-en-zh-tw-translations-twp-300k
  - zetavg/zh-tw-wikipedia
  - zetavg/tw-sinica-corpus-word-frequency
  - RyokoAI/ShareGPT52K
language:
  - zh
  - en

TW-Pythia-6.9B-Chat

Taiwanese Mandarin Pythia Language Model, instruction-tuned for dialogue.

Version 0.2

Model Details

The TW-Pythia model is derived from the Apache-2.0-licenced Pythia language model, with 8000 new Traditional Chinese tokens added, embed layers resized and re-trained.

Basics

  • Developed by: @zetavg based on EleutherAI's Pythia language model.
  • Model type: Transformer-based GPT-NeoX Causal Language Model
  • Languages: English, Traditional Chinese
  • License: Unknown due to unconfirmed usage license of the training data
  • Derived from model: EleutherAI/pythia-6.9b

Model Sources

Uses

Currently, this model has not demonstrated any practical value in Traditional Chinese processing without further training, but it does possess some basic Chinese-English translation capabilities.

Training Details

Training Data

Training Procedure

First, we build a BPE tokenizer based on the original Pythia tokenizer with 8000 new Traditional Chinese tokens added.

Then, we resize the embedding layer of the pythia-6.9b model to accommodate the new vocabulary size, and we train only the input/output embedding layers to allow the model to learn the new Traditional Chinese words and phrases.

At last, LoRA weights are added to the model and fine-tuned for instruction following.

Training Hyperparameters

Hardware

  • 1xH100 80GB GPU on Lambda Cloud (with Skypilot), about 20h in total.