--- language: - zh library_name: transformers base_model: fnlp/bart-base-chinese tags: - BART - Chinese - Traditional Chinese - Cantonese --- ## BertTokenizer-based Tokenizer that can tokenize Chinese/Cantonese sentences into phrases Apart from the original 51,271 tokens from the base tokenizer, 194,020 additional Chinese vocabularies are added to this tokenizer. Usage: ``` from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('raptorkwok/wordseg-tokenizer') ``` ### Examples: Cantonese Example 1 ``` tokenizer.tokenize("我哋今日去睇陳奕迅演唱會") # Output: ['我哋', '今日', '去', '睇', '陳奕迅', '演唱會'] ``` Cantonese Example 2 ``` tokenizer.tokenize("再嘈我打爆你個嘴!") # Output: ['再', '嘈', '我', '打爆', '你', '個', '嘴', '!'] ``` Chinese Example 1 ``` tokenizer.tokenize("你很肥胖呢,要開始減肥了。") # Output: ['你', '很', '肥胖', '呢', ',', '要', '開始', '減肥', '了', '。'] ``` Chinese Example 2 ``` tokenizer.tokenize("案件現由大嶼山警區重案組接手調查。") # Output: ['案件', '現', '由', '大嶼山', '警區', '重案組', '接手', '調查', '。'] ``` ## Questions? Please feel free to leave a message in the Community tab.