Turkish WordPiece Tokenizer

This repository contains a WordPiece tokenizer specifically trained on 1 billion Turkish sentences, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the tokenizers library and includes both cased and uncased versions for flexibility.

Repository Structure

File Name	Description
`special_tokens_map.json`	Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers.
`tokenizer_config.json`	Contains configuration details for the tokenizer, including model type and special token settings.
`turkish_wordpiece_tokenizer.json`	The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased).
`turkish_wordpiece_tokenizer_uncased.json`	The uncased version of the WordPiece tokenizer.
`turkish_wordpiece_tokenizer_post_token_uncased.json`	The post-tokenization configuration for the uncased tokenizer.

Features

WordPiece Tokenization: Breaks words into subword units for better handling of rare or unseen words.
Support for Cased and Uncased Text: Includes separate tokenizers for preserving case sensitivity and ignoring case.
Optimized for Turkish: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
Special Tokens: Includes commonly used tokens such as:
- [UNK] (unknown token)
- [PAD] (padding token)
- [CLS] (classification token)
- [SEP] (separator token)

Usage

To use the tokenizer, you can load it with the Hugging Face transformers library or the tokenizers library.

Loading with `tokenizers`:

from tokenizers import Tokenizer

# Load the uncased tokenizer
tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")

# Tokenize a sentence
output = tokenizer.encode("Merhaba dünya!")
print(output.tokens)

Tokenizer Training Details

Dataset: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
Model: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
Uncased Variant: Lowercases all text during tokenization to ignore case distinctions.

Applications

Text Classification
Machine Translation
Question Answering
Text Summarization
Named Entity Recognition (NER)

Citation

If you use this tokenizer in your research or applications, please cite it as follows:

@misc{turkish_wordpiece_tokenizer,
  title={Turkish WordPiece Tokenizer},
  author={Mert Cobanov},
  year={2024},
  url={https://huggingface.co./mertcobanov/turkish-wordpiece-tokenizer}
}

Contributions

Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.

Let me know if you'd like further adjustments!