Converting to native Transformers

#81
by cyrilvallez HF staff - opened
No description provided.
cyrilvallez changed pull request title from Upload folder using huggingface_hub to Converting to native Transformers

This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)

Will this break compatibility for implementations like llama.cpp?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Thank you for your support. I will take a look in the next few days, and if it operates normally, we will use this set of specifications to merge into transformers

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

I found that this code does not run properly, should it be modified this way

define _pad(
self,
encoded_inputs: Union[Dictionary[string, EncodedInput], BatchEncoding],
max_length: Optional[integer] = None,
padding_side: str = "left", # Add this code
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
Right parenthesis, arrow, dict colon

Additionally, the apply_chat_template function has been deprecated, and you can use the one provided by transformers directly. Can this comment be deleted?
一个完整的代码或许可以如下

import regex as re
import base64
import os
import tiktoken
from typing import List, Optional, Union, Dict
from transformers import PreTrainedTokenizer
from transformers.utils import PaddingStrategy
from transformers.tokenization_utils_base import EncodedInput, BatchEncoding


class ChatGLM4Tokenizer(PreTrainedTokenizer):
    vocab_files_names = {"vocab_file": "tokenizer.model"}
    model_input_names = ["input_ids", "attention_mask", "position_ids"]

    def __init__(
            self,
            vocab_file,
            clean_up_tokenization_spaces=False,
            **kwargs
    ):
        self.name = "GLM4Tokenizer"
        self.vocab_file = vocab_file
        pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        self.pat_str = re.compile(pat_str)

        mergeable_ranks = {}
        with open(vocab_file) as f:
            for line in f:
                token, rank = line.strip().split()
                rank = int(rank)
                token = base64.b64decode(token)
                mergeable_ranks[token] = rank

        self.mergeable_ranks = mergeable_ranks

        self.tokenizer = tiktoken.Encoding(
            name="my_tokenizer",
            pat_str=pat_str,
            mergeable_ranks=mergeable_ranks,
            special_tokens={}
        )
        self.decoder = {rank: token for token, rank in mergeable_ranks.items()}
        self.n_words = len(self.decoder)

        super().__init__(
            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
            **kwargs
        )

    @property
    def vocab_size(self):
        return self.n_words

    def get_vocab(self):
        """ Returns vocab as a dict """
        vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
        vocab.update(self.added_tokens_encoder)
        return vocab

    def convert_tokens_to_string(self, tokens: List[Union[bytes, str, int]]) -> str:
        """
        Converts a sequence of tokens in a single string.
        """
        text = ""
        temp = b""
        for t in tokens:
            if isinstance(t, int):
                t = chr(t)
            if isinstance(t, str):
                if temp:
                    text += temp.decode("utf-8", errors="replace")
            elif isinstance(t, bytes):
                temp += t
            else:
                raise TypeError("token should only be of type int, bytes or str")
        if temp:
            text += temp.decode("utf-8", errors="replace")
        return text

    def _tokenize(self, text, **kwargs):
        tokens = []
        ids = self.tokenizer.encode(text)
        for t in ids:
            tokens.append(self.decoder[t])
        return tokens

    def _convert_token_to_id(self, token):
        """ Converts a token (str) in an id using the vocab. """
        return self.mergeable_ranks[token]

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.decoder.get(index, "")

    def save_vocabulary(self, save_directory, filename_prefix=None):
        """
        Save the vocabulary and special tokens file to a directory.

        Args:
            save_directory (`str`):
                The directory in which to save the vocabulary.
            filename_prefix (`str`, *optional*):
                An optional prefix to add to the named of the saved files.

        Returns:
            `Tuple(str)`: Paths to the files saved.
        """
        if os.path.isdir(save_directory):
            vocab_file = os.path.join(
                save_directory, self.vocab_files_names["vocab_file"]
            )
        else:
            vocab_file = save_directory

        with open(self.vocab_file, 'rb') as fin:
            proto_str = fin.read()

        with open(vocab_file, "wb") as writer:
            writer.write(proto_str)

        return (vocab_file,)

    def get_prefix_tokens(self):
        prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
        return prefix_tokens

    def build_single_message(self, role, metadata, message, tokenize=True):
        assert role in ["system", "user", "assistant", "observation"], role
        if tokenize:
            role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
                                                                                              disallowed_special=())
            message_tokens = self.tokenizer.encode(message, disallowed_special=())
            tokens = role_tokens + message_tokens
            return tokens
        else:
            return str(f"<|{role}|>{metadata}\n{message}")

    def build_inputs_with_special_tokens(
            self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BERT sequence has the following format:

        - single sequence: `[CLS] X [SEP]`
        - pair of sequences: `[CLS] A [SEP] B [SEP]`

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        prefix_tokens = self.get_prefix_tokens()
        token_ids_0 = prefix_tokens + token_ids_0
        if token_ids_1 is not None:
            token_ids_0 = token_ids_0 + token_ids_1 + [self.convert_tokens_to_ids("<eos>")]
        return token_ids_0

    def _pad(
            self,
            encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
            max_length: Optional[int] = None,
            padding_side: str = "left",
            padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
            pad_to_multiple_of: Optional[int] = None,
            return_attention_mask: Optional[bool] = None,
    ) -> dict:
        """
        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)

        Args:
            encoded_inputs:
                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
            max_length: maximum length of the returned list and optionally padding length (see below).
                Will truncate by taking into account the special tokens.
            padding_strategy: PaddingStrategy to use for padding.

                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
                - PaddingStrategy.DO_NOT_PAD: Do not pad
                The tokenizer padding sides are defined in self.padding_side:

                    - 'left': pads on the left of the sequences
                    - 'right': pads on the right of the sequences
            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
                `>= 7.5` (Volta).
            return_attention_mask:
                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
        """
        # Load from model defaults

        required_input = encoded_inputs[self.model_input_names[0]]
        seq_length = len(required_input)

        if padding_strategy == PaddingStrategy.LONGEST:
            max_length = len(required_input)

        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of

        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length

        # Initialize attention mask if not present.
        if "attention_mask" not in encoded_inputs:
            encoded_inputs["attention_mask"] = [1] * seq_length

        if "position_ids" not in encoded_inputs:
            encoded_inputs["position_ids"] = list(range(seq_length))

        if needs_to_be_padded:
            difference = max_length - len(required_input)

            if "attention_mask" in encoded_inputs:
                encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
            if "position_ids" in encoded_inputs:
                encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
            encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input

        return encoded_inputs

Hi @zRzRzRzRzRzRzR ! Not sure how you tested this but the _pad issue is actually coming from your own version. With the new model added in transformers, nothing relies on your custom .py files anymore. Since it is not already merged (will be soon, we are just correcting issues from our automatic file converter internally), you need to install transformers from the correct branch for now: pip install git+https://github.com/huggingface/transformers.git@glm. Then, to try it out, specify the revision of this PR on the hub when loading the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = 3

tokenizer = AutoTokenizer.from_pretrained('THUDM/glm-4-9b-chat', revision="refs/pr/81")
model = AutoModelForCausalLM.from_pretrained('THUDM/glm-4-9b-chat', torch_dtype=torch.float16, revision="refs/pr/81").to(device)

sequence = 'Hello I am doing'
inputs = tokenizer.encode(sequence, return_tensors='pt').to(device)
out = model.generate(inputs, do_sample=False, max_new_tokens=50)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

Let me know if you still experience issues when doing this. From my tests, everything runs smoothly.

PS: I would advise you to wait that https://github.com/huggingface/transformers/pull/33823 is correctly merged in transformers before merging this PR on the hub. That way, users will only need to install it from main to have the model already available.
PPS: Not exactly sure how llama.cpp works, but if a correct transformers version (containing the model definition) is installed in the environment, I don't see any reason why it should not work properly.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Yes, I have also noticed this issue, therefore, I fixed the padding problem in the main branch of this code yesterday. There is no need to modify any of your PR, I have already uploaded the modified tokenizer_chatlm.py code to the main branch of this repository.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited Oct 7

Regarding the installation, I have successfully installed the GLM branch and debugged the generation part, and it is working properly. Once this PR is merged into the main branch and a release is published, I will proceed with merging this PR( you can merge the changes I made to tokenizer_chatglm.py yesterday into this PR)

Could you be more clear about the changes you want to the tokenizer? As I said, with this PR no code relies on your .py files (which means you could delete them all in this repo, I forgot when opening this PR). The tokenizer is one of our PreTrainedTokenizerFast (created from your tokenizer.model), in which I added a post processor to always add your two BOS tokens ([gMASK]<sop>) automatically. If you want to change this and/or the chat template let me know, but otherwise the inner workings of the tokenizer now rely on our own PreTrainedTokenizerFast (otherwise you can easily change these small settings of the tokenizer yourself after you merge this).

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited Oct 7

Oh, I understand what you mean now. No modifications are needed, and any changes you’ve made on GitHub don’t require further adjustments.

The content here https://github.com/huggingface/transformers/pull/33823 doesn’t need any changes. It’s perfectly fine, and I sincerely apologize for the misunderstanding! The chat template doesn’t need any modifications either.

and, the code should like this, not use AutoTokenizer, using with PreTrainedTokenizerFast right?

Here is a simple code for dialogue using code

from transformers import PreTrainedTokenizerFast, GlmForCausalLM

device = 3

tokenizer = PreTrainedTokenizerFast.from_pretrained('glm-4-9b-chat')
model = GlmForCausalLM.from_pretrained('glm-4-9b-chat').to(device)

message = [
{
"role": "system",
content": "Answer the following question.
},
{
"role": "user",
content": "How many legs does a cat have?
},
{
"role": "assistant",
content": "A cat has four legs.
}
{
"role": "user",
content": "Is the animal I just asked about a mammal?
}
]

inputs = tokenizer.apply_chat_template(
message,
return_tensors='pt',
add_generation_prompt=True,
return_dict=True
).to(device)

input_len = inputs['input_ids'].shape[1]
generate_kwargs = {
"input_ids": inputs['input_ids'],
"attention_mask": inputs['attention_mask'],
"max_new_tokens": 128,
"do_sample": False,
}
out = model.generate(**generate_kwargs)
print(tokenizer.decode(out[0][input_len:], skip_special_tokens=True))

The model file only contains this content

.

├── config.json
├── configuration.json
├── generation_config.json
├── LICENSE
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── README_en.md
├── README.md
├── tokenizer_config.json
└── tokenizer.json

It is running normally. I checked the template, it is correct. Therefore, no modification is needed.
I believe the misunderstanding has been resolved.
After the code merge, all code except what is described here does not need to be mandatorily retained; it is no longer necessary.

Yes exactly, you are correct concerning the files! However you can still use AutoModelForCausalLM and AutoTokenizer to load everything, they will automatically point to the correct classes (if you check config.json and tokenizer_config.json you will see that they have a field pointing to the correct class 🤗). They are used to load every models/tokenizers in the same way independently of architectures!

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Yes, I saw it, this solution is compatible with my original writing!

@cyrilvallez @zRzRzRzRzRzRzR Hi both, since the change has been merged to huggingface. How can we use the config in this pr to use the native hf version of GLM in from_pretrained?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

I noticed that version 4.46 of transformers has not been released yet. If I directly overwrite with the new version, will it prevent the old version of transformers (4.44) from being used? I don't have confirmed testing on this.

@zRzRzRzRzRzRzR Thanks for the prompt reply! But I wonder can we use what @cyrilvallez has made in this repo in from_pretrained? Just for test purpose.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Using the branch provided by @cyrilvallez works fine in the 4.46 version; I will go to the company tomorrow for the final confirmation.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited 22 days ago

We found that this code cannot run properly below version 4.45.2, so we are preparing to create a new repository specifically for your version. This new repository will require users to use version 4.46 or above of transformers for inference. Considering that a very large number of open-source frameworks are not yet compatible with transformers 4.46, we have decided not to make any changes to the main branch for now.

@zRzRzRzRzRzRzR That will be great. Actually, you can create a branch in this repo like hf4.46 etc, so we can avoid any potential confusion that another repo can lead to.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Yes, I proposed two options to my colleagues: creating a new branch and creating a new repository. We ultimately decided to use a new repository, which is named glm-4-9b-hf. This old repository will cease maintenance and will have a prominent notice suggesting the switch to the new repository.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment