Logo

French to English Machine Translation

French to English language translation using sequence to sequence transformer.
View Demo

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact

About The Project

Logo

This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.

(back to top)

Built With

  • Python
  • TensorFlow
  • Keras
  • NumPy
  • Pandas

(back to top)

Getting Started

Please follow these simple steps to setup this project locally.

Dependencies

Here are the list all libraries, packages and other dependencies that need to be installed to run this project.

For example, this is how you would list them:

  • TensorFlow 2.16.1
    conda install -c conda-forge tensorflow
    
  • Keras 2.15.0
    conda install -c conda-forge keras
    
  • Gradio 4.24.0
    conda install -c conda-forge gradio
    
  • NumPy 1.26.4
    conda install -c conda-forge numpy
    

Alternative: Export Environment

Alternatively, clone the project repository, install it and have all dependencies needed.

conda env export > requirements.txt

Recreate it using:

conda env create -f requirements.txt

Installation

# clone project   
git clone https://huggingface.co./spaces/KameliaZaman/French-to-English-Translation/tree/main

# go inside the project directory 
cd French-to-English-Translation

# install the required packages
pip install -r requirements.txt

# run the gradio app
python app.py 

(back to top)

Usage

Dataset

Dataset is from "https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench" which contains 2 columns where one column has english words/sentences and the other one has french words/sentence

Model Architecture

The model architecture consists of an Encoder-Decoder Long Short-Term Memory network with an embedding layer. It was built on a Neural Machine Translation architecture where sequence-to-sequence framework with attention mechanisms was applied.

Logo

Data Preparation

  • The parallel corpus containing French and English sentences is preprocessed.
  • Text is tokenized and converted into numerical representations suitable for input to the neural network.

Model Training

  • The sequence-to-sequence model is constructed, comprising an encoder and decoder.

  • Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms.

    def create_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
      # Create the model
      model = Sequential()
      model.add(Embedding(src_vocab_size, n_units, input_length=src_length, mask_zero=True))
      model.add(LSTM(n_units))
      model.add(RepeatVector(tar_timesteps))
      model.add(LSTM(n_units, return_sequences=True))
      model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
      return model
    
    model = create_model(src_vocab_size, tar_vocab_size, src_length, tar_length, 256)
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    
    history = model.fit(trainX,
              trainY,
              epochs=20,
              batch_size=64,
              validation_split=0.1,
              verbose=1,
              callbacks=[
                            EarlyStopping(
                            monitor='val_loss',
                            patience=10,
                            restore_best_weights=True
                        )
                ])
    
    Logo

Model Evaluation

  • The trained model is evaluated on the test set to measure its accuracy.

  • Metrics such as BLEU score has been used to quantify the quality of translations.

    Logo Logo

Deployment

  • Gradio is utilized for deploying the trained model.

  • Users can input a French text, and the model will translate it to English.

    import string
    import re
    from unicodedata import normalize
    import numpy as np
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences
    from keras.utils import to_categorical
    from keras.models import Sequential,load_model
    from keras.layers import LSTM,Dense,Embedding,RepeatVector,TimeDistributed
    from keras.callbacks import EarlyStopping
    from nltk.translate.bleu_score import corpus_bleu
    import pandas as pd
    from string import punctuation
    import matplotlib.pyplot as plt
    from IPython.display import Markdown, display
    import gradio as gr
    import tensorflow as tf
    from tensorflow.keras.models import load_model
    
    total_sentences = 10000
    
    dataset = pd.read_csv("./eng_-french.csv", nrows = total_sentences)
    
    def clean(string):
        # Clean the string
        string = string.replace("\u202f"," ") # Replace no-break space with space
        string = string.lower()
    
        # Delete the punctuation and the numbers
        for p in punctuation + "«»" + "0123456789":
            string = string.replace(p," ")
    
        string = re.sub('\s+',' ', string)
        string = string.strip()
    
        return string
    
    dataset = dataset.sample(frac=1, random_state=0)
    dataset["English words/sentences"] = dataset["English words/sentences"].apply(lambda x: clean(x))
    dataset["French words/sentences"] = dataset["French words/sentences"].apply(lambda x: clean(x))
    
    dataset = dataset.values
    dataset = dataset[:total_sentences]
    
    source_str, target_str = "French", "English"
    idx_src, idx_tar = 1, 0
    
    def create_tokenizer(lines):
        # fit a tokenizer
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(lines)
        return tokenizer
    
    def max_len(lines):
        # max sentence length
        return max(len(line.split()) for line in lines)
    
    def encode_sequences(tokenizer, length, lines):
        # encode and pad sequences
        X = tokenizer.texts_to_sequences(lines) # integer encode sequences
        X = pad_sequences(X, maxlen=length, padding='post') # pad sequences with 0 values
        return X
    
    def word_for_id(integer, tokenizer):
        # map an integer to a word
        for word, index in tokenizer.word_index.items():
            if index == integer:
                return word
        return None
    
    def predict_seq(model, tokenizer, source):
        # generate target from a source sequence
        prediction = model.predict(source, verbose=0)[0]
        integers = [np.argmax(vector) for vector in prediction]
        target = list()
        for i in integers:
            word = word_for_id(i, tokenizer)
            if word is None:
                break
            target.append(word)
        return ' '.join(target)
    
    src_tokenizer = create_tokenizer(dataset[:, idx_src])
    src_vocab_size = len(src_tokenizer.word_index) + 1
    src_length = max_len(dataset[:, idx_src])
    tar_tokenizer = create_tokenizer(dataset[:, idx_tar])
    
    model = load_model('./french_to_english_translator.h5')
    
    def translate_french_english(french_sentence):
        # Clean the input sentence
        french_sentence = clean(french_sentence)
        # Tokenize and pad the input sentence
        input_sequence = encode_sequences(src_tokenizer, src_length, [french_sentence])
        # Generate the translation
        english_translation = predict_seq(model, tar_tokenizer, input_sequence)
        return english_translation
    
    gr.Interface(
        fn=translate_french_english,
        inputs="text",
        outputs="text",
        title="French to English Translator",
        description="Translate French sentences to English."
    ).launch()
    
    Logo

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See MIT License for more information.

(back to top)

Contact

Kamelia Zaman Moon - [email protected]

Project Link: https://huggingface.co./spaces/KameliaZaman/French-to-English-Translation

(back to top)

Downloads last month
33
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.