Edit model card

fasttext-med-en-zh-identification

This model is an intermediate result of the EPCD (Easy-Data-Clean-Pipeline) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses fastText.

Data Composition

General Chinese Pretraining Dataset

Medical Chinese Pretraining Dataset

General English Pretraining Dataset

Medical English Pretraining Datasets

The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!

Data Cleaning Process

  • Initial dataset processing:

    • For the Chinese training datasets, the pretraining corpus is split by \n, and any leading or trailing spaces are removed.
    • For the English training datasets, the pretraining corpus is split by \n, all letters are converted to lowercase, and any leading or trailing spaces are removed.
  • Word count statistics:

    • For Chinese, the jieba package is used for tokenization, and stopwords and non-Chinese characters are further filtered using jionlp.
    • For English, the nltk package is used for tokenization, with built-in stopwords for filtering.
  • Sample filtering based on word count (heuristic thresholds):

    • For Chinese: Keep only samples with more than 5 words.
    • For English: Keep only samples with more than 5 words.
  • Dataset splitting: 90% of the data is used for training and 10% for testing.

Model Performance

Dataset Accuracy
Train 0.9994
Test 0.9998

Usage Example

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model('model.bin')
model.predict("Hello, world!")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Datasets used to train ytzfhqs/fasttext-med-en-zh-identification