Accept tokens instead of string and question regarding tokenizer behaviour

#35
by MH1P - opened

Hi all,

I usually prefer passing tokens directly to an embedding model (it make more sens to me, as the max sequence length is express in tokens, not in string length). It looks to me that this is not possible with the .encode() method you have.

Attempting to do so, I noticed that the tokenizer is calling .strip() and .lower(). Is the model oblivious to capital letters? Did you quantify the impact on doing so when capital letters are used, e.g. for 'NY' (New York) vs 'ny'. Could you see an impact on the ability to distinguish between proper names and other words?

Anyway, if anyone else is looking to do feed token instead of strings, here is some sample code.

Cheers!

from transformers import AutoModel, AutoTokenizer
import torch


def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor):
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
print(type(tokenizer))

input = ["Hello, world!", "How are you today?"]
# Follow transformation applied in custom_st.py
input = [s.strip() for s in input]
input = [s.lower() for s in input]

batch_tokenized = tokenizer(input, return_tensors='pt', padding=True, truncation="longest_first",)
print(batch_tokenized)
print(type(batch_tokenized))

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
print(type(model))

embs = model(**batch_tokenized)[0]
embs = mean_pooling(embs, batch_tokenized['attention_mask'])
print(type(embs))
print(embs)

print("----")

embs2 = model.encode(input, normalize_embeddings=False)
print(type(embs2))
print(embs2)
Jina AI org

Hi @MH1P ,

Attempting to do so, I noticed that the tokenizer is calling .strip() and .lower(). Is the model oblivious to capital letters?

No, our tokenizer is case sensitive. Where did you notice that behavior?

Here's an example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')

txt = 'New York'
print(tokenizer(txt))
print(tokenizer(txt.lower()))

Output:

{'input_ids': [0, 2356, 5753, 2], 'attention_mask': [1, 1, 1, 1]}
{'input_ids': [0, 3525, 70662, 92, 2], 'attention_mask': [1, 1, 1, 1, 1]}

Also, while your code looks good, it doesn't make use of LoRA adapters which is a very important part of jina-embeddings-v3. I recommend taking a look at our implementation of the encode() function

Hi, and thank you for your quick answer!

My apologies, it seems that I had an uncleared state in my notebook and then got confused in my exploration of the code with https://huggingface.co./jinaai/jina-embeddings-v3/blob/main/custom_st.py (line :

        # strip
        to_tokenize = [[str(s).strip() for s in col] for col in to_tokenize]

        # Lowercase
        if self.do_lower_case:
            to_tokenize = [[s.lower() for s in col] for col in to_tokenize]

So what is the use of custom_st.py ?

Here's an update comparison code that indeed shows the same results.

def mean_pooling( token_embeddings: torch.Tensor, attention_mask: torch.Tensor):
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)


def tokenize_model(input):
    batch_tokenized = tokenizer(input, return_tensors='pt', padding=True, truncation="longest_first",)
    embs = model(**batch_tokenized)[0]
    embs = mean_pooling(embs, batch_tokenized['attention_mask'])
    return embs.detach().numpy()


def direct_model(input):
    return model.encode(input, normalize_embeddings=False)


input = ["Hello, world!", "How are you today?"]
print(tokenize_model(input))
print(direct_model(input))

exit(0)

Going back to the root of my exploration is there a reason not to allow a sequence of tokens (or a batch of sequences) as input of the encode method?

Thank you!

Sign up or log in to comment