marianMT_hin_eng_cs

This model is a fine-tuned version of Helsinki-NLP/opus-mt-mul-en on ar5entum/hindi-english-code-mixed dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1450
  • Bleu: 77.8649
  • Gen Len: 74.8945

Model description

The model is specifically designed to translate Hindi text written in Devanagari script into a mixed format where Hindi words are retained in Devanagari while English words are converted to Roman script. This model effectively handles the complexities of code-switching, producing output that accurately reflects the intended language mixing.

Example:

Hindi Hindi + English CS
तो वो टोटली मेरे घर के प्लान पे डिपेंड करता है to वो totally मेरे घर के plan पे depend करता है
मांग लो भाई बहुत नेसेसरी है मांग लो भाई बहुत necessary है
टेलीविज़न में क्या प्रोग्राम चल रहा है? television में क्या program चल रहा है?
from transformers import MarianMTModel, MarianTokenizer

class HinEngCS:
    def __init__(self, model_name='ar5entum/marianMT_hin_eng_cs'):
        self.model_name = model_name
        self.tokenizer = MarianTokenizer.from_pretrained(model_name)
        self.model = MarianMTModel.from_pretrained(model_name)

    def predict(self, input_text):
        tokenized_text = self.tokenizer(input_text, return_tensors='pt')
        translated = self.model.generate(**tokenized_text)
        translated_text = self.tokenizer.decode(translated[0], skip_special_tokens=True)
        return translated_text
model = HinEngCS()

input_text = "आज मैं नानयांग टेक्नोलॉजिकल यूनिवर्सिटी में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा साइंस टेक्नोलॉजी और इनोवेशन में हमारे सहयोग को और बढ़ाएंगे।"
model.predict(input_text)
# आज मैं नानयांग technological university में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा science technology और innovation में हमारे सहयोग को और बढ़ाएंगे।

Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 50
  • eval_batch_size: 50
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • total_train_batch_size: 100
  • total_eval_batch_size: 100
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 30.0

Training results

Training Loss Epoch Step Bleu Gen Len Validation Loss
1.5823 1.0 1118 11.6257 77.1622 1.1778
0.921 2.0 2236 33.2917 76.1459 0.6357
0.6472 3.0 3354 47.3533 75.9194 0.4504
0.5246 4.0 4472 55.2169 75.6871 0.3579
0.4228 5.0 5590 60.8262 75.5777 0.3041
0.3745 6.0 6708 64.8987 75.4424 0.2693
0.3552 7.0 7826 67.7607 75.2438 0.2455
0.3324 8.0 8944 69.635 75.1036 0.2274
0.2912 9.0 10062 71.3086 75.0326 0.2117
0.2591 10.0 11180 72.392 74.9607 0.2001
0.2471 11.0 12298 73.4758 74.9251 0.1899
0.236 12.0 13416 74.4219 74.833 0.1822
0.2265 13.0 14534 75.1435 74.9069 0.1745
0.2152 14.0 15652 75.7614 74.7409 0.1695
0.2078 15.0 16770 76.2353 74.7092 0.1641
0.2048 16.0 17888 76.7381 74.7274 0.1593
0.1975 17.0 19006 76.9954 74.7217 0.1559
0.1943 18.0 20124 77.421 74.6641 0.1524
0.1987 19.0 21242 77.8231 74.6833 0.1495
0.1855 20.0 22360 78.0784 74.6804 0.1472

Framework versions

  • Transformers 4.45.0.dev0
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1
Downloads last month
10
Safetensors
Model size
77.1M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ar5entum/marianMT_hin_eng_cs

Finetuned
(11)
this model

Dataset used to train ar5entum/marianMT_hin_eng_cs