Releasing Hindi ELECTRA model

This is a first attempt at a Hindi language model trained with Google Research's ELECTRA.

As of 2022 I recommend Google's MuRIL model trained on English, Hindi, and other major Indian languages, both in their script and latinized script: https://huggingface.co./google/muril-base-cased and https://huggingface.co./google/muril-large-cased

For causal language models, I would suggest https://huggingface.co./sberbank-ai/mGPT, though this is a large model

Tokenization and training CoLab

I originally used a modified ELECTRA for finetuning, but now use SimpleTransformers.

Blog post - I was greatly influenced by: https://huggingface.co./blog/how-to-train

Example Notebooks

This small model has comparable results to Multilingual BERT on BBC Hindi news classification and on Hindi movie reviews / sentiment analysis (using SimpleTransformers)

You can get higher accuracy using ktrain by adjusting learning rate (also: changing model_type in config.json - this is an open issue with ktrain): https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w?usp=sharing

Question-answering on MLQA dataset: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar#scrollTo=IcFoAHgKCUiQ

A larger model (Hindi-TPU-Electra) using ELECTRA base size outperforms both models on Hindi movie reviews / sentiment analysis, but does not perform as well on the BBC news classification task.

Corpus

Download: https://drive.google.com/drive/folders/1SXzisKq33wuqrwbfp428xeu_hDxXVUUu?usp=sharing

The corpus is two files:

Bonus notes:

  • Adding English wiki text or parallel corpus could help with cross-lingual tasks and training

Vocabulary

https://drive.google.com/file/d/1-6tXrii3tVxjkbrpSJE9MOG_HhbvP66V/view?usp=sharing

Bonus notes:

  • Created with HuggingFace Tokenizers; you can increase vocabulary size and re-train; remember to change ELECTRA vocab_size

Training

Structure your files, with data-dir named "trainer" here

trainer
- vocab.txt
- pretrain_tfrecords
-- (all .tfrecord... files)
- models
-- modelname
--- checkpoint
--- graph.pbtxt
--- model.*

CoLab notebook gives examples of GPU vs. TPU setup

configure_pretraining.py

Conversion

Use this process to convert an in-progress or completed ELECTRA checkpoint to a Transformers-ready model:

git clone https://github.com/huggingface/transformers
python ./transformers/src/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py
  --tf_checkpoint_path=./models/checkpointdir
  --config_file=config.json
  --pytorch_dump_path=pytorch_model.bin
  --discriminator_or_generator=discriminator
python
from transformers import TFElectraForPreTraining
model = TFElectraForPreTraining.from_pretrained("./dir_with_pytorch", from_pt=True)
model.save_pretrained("tf")

Once you have formed one directory with config.json, pytorch_model.bin, tf_model.h5, special_tokens_map.json, tokenizer_config.json, and vocab.txt on the same level, run:

transformers-cli upload directory
Downloads last month
1,033
Safetensors
Model size
14.7M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using monsoon-nlp/hindi-bert 1

Collection including monsoon-nlp/hindi-bert