What is missing from my mental model about semantics and this embedding model?

#20
by xhrud - opened

Hi there,

I am trying to get the semantic similarity between these four words: "attack", "fight", "battle" and "walk". But, the results are like this (full code is at the bottom):

$ python jina.py
Attack vs fight: 0.8896375298500061
Fight vs battle: 0.9226689338684082
Attach vs battle: 0.8903734683990479
Attack vs walk: 0.8300195932388306
...

I thought that the jina embeddings would have greater distance between the first three words and the last. I'm surprised to see that the cosine distance between attack and walk is still "high" ~ 0.83. What's wrong with my mental model here?

I see that there are references to text-matching (https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/) and other references which classify text (https://medium.com/@anoopjohny2000/unraveling-textual-semantic-similarity-jina-embeddings-vs-llama-models-implementations-c375461d1d0e). But, no code samples for that term.

Is a good rule of thumb to say an item is semantically close if it has 0.88 distance score or above? Or, say 0.85? This seems totally arbitrary to me.

How should I think about using this embedding model in this way? I also tried with longer sentences, and there is distance between the sentences, but now the cutoff seems to be 0.78 or so.

I'm looking for a generalized way to see if sentences could be compared semantically against a set of actions. For example, I want to see how a sentence like "Attack Baldur the Great" matches up against keywords like "attack", "discuss", "journey", "investigate". I would like to categorize the sentence as matching the "attack" keyword, but would like the same thing to happen if the phrase was "Battle with Baldur the Great" or "Engage in combat with Baldur the Great." Hopefully you can see where I'm going with this!

jina.py:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method

embeddings = model.encode(
    ['Attack',
     'Fight',
     'Battle',
     'Walk'
     ]
    )
print(f"Attack vs fight: {cos_sim(embeddings[0], embeddings[1])}")
print(f"Fight vs battle: {cos_sim(embeddings[1], embeddings[2])}")
print(f"Attach vs battle: {cos_sim(embeddings[0], embeddings[2])}")
print(f"Attack vs walk: {cos_sim(embeddings[0], embeddings[3])}")

print("\n")

embeddings = model.encode(
    ['Attack the orc',
     'Fight baldur the great',
     'Battle with minos the invisible',
     "Walk to the store",
     ])
print(f"Sentence attack vs fight: {cos_sim(embeddings[0], embeddings[1])}")
print(f"Sentence fight vs battle: {cos_sim(embeddings[1], embeddings[2])}")
print(f"Sentence attack vs battle: {cos_sim(embeddings[0], embeddings[2])}")
print(f"Sentence attack vs go to the store: {cos_sim(embeddings[0], embeddings[3])}")

print("\n")

def compute_similarity(sentence1, sentence2):
    embeddings = model.encode([sentence1, sentence2])
    result = cos_sim(embeddings[0], embeddings[1])
    return result

similarity1 = compute_similarity("I love apple.", "I like Banana.")
similarity2 = compute_similarity("I like OpenAI.", "I like OpenAI.")
similarity3 = compute_similarity("I like OpenAI.", "I don't like OpenAI.")

print(f"Apples vs banana: {similarity1}")
print(f"OpenAI = OpenAI: {similarity2}")
print(f"Like vs don't like: {similarity3}")

Results:

Attack vs fight: 0.8896375298500061
Fight vs battle: 0.9226689338684082
Attach vs battle: 0.8903734683990479
Attack vs walk: 0.8300195932388306


Sentence attack vs fight: 0.7887848615646362
Sentence fight vs battle: 0.8090698719024658
Sentence attack vs battle: 0.7822835445404053
Sentence attack vs go to the store: 0.7188634276390076


Apples vs banana: 0.8035626411437988
OpenAI = OpenAI: 1.0
Like vs don't like: 0.9356719851493835

I decided to refactor to use keywords against phrases:

It seems strange to see this:

discuss vs Ask the peasant about the disturbances: 0.787786066532135
discuss vs Explore the area: 0.8331437706947327

Those scores seem backwards.

Or, this:

attack vs Battle with minos the invisible: 0.8007676601409912
attack vs Talk to the villager: 0.7797062397003174

Those seem too close for comfort.

Full results:

python /app/simple-jina.py
attack vs Attack the orc: 0.9002262353897095
attack vs Fight baldur the great: 0.8177021741867065
attack vs Battle with minos the invisible: 0.8007676601409912
attack vs Talk to the villager: 0.7797062397003174
attack vs Ask the peasant about the disturbances: 0.7647866010665894
attack vs Explore the area: 0.8125302791595459
attack vs Walk about the town: 0.7898878455162048
investigate vs Attack the orc: 0.8023191690444946
investigate vs Fight baldur the great: 0.7434965968132019
investigate vs Battle with minos the invisible: 0.761911153793335
investigate vs Talk to the villager: 0.7961578965187073
investigate vs Ask the peasant about the disturbances: 0.8098406791687012
investigate vs Explore the area: 0.8753792643547058
investigate vs Walk about the town: 0.7892642021179199
discuss vs Attack the orc: 0.7844048142433167
discuss vs Fight baldur the great: 0.7628872394561768
discuss vs Battle with minos the invisible: 0.7507494688034058
discuss vs Talk to the villager: 0.8287510871887207
discuss vs Ask the peasant about the disturbances: 0.787786066532135
discuss vs Explore the area: 0.8331437706947327
discuss vs Walk about the town: 0.8039745092391968

The script is:

$ cat jina.py
from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method

keywords = ['attack', 'investigate', 'discuss' ]
phrases = [
    'Attack the orc',
    'Fight baldur the great',
    'Battle with minos the invisible',
    'Talk to the villager',
    'Ask the peasant about the disturbances',
    "Explore the area",
    'Walk about the town',
]

for keyword in keywords:
    for phrase in phrases:
        embeddings = model.encode( [ keyword, phrase ])
        print(f"{keyword} vs {phrase}: {cos_sim(embeddings[0], embeddings[1])}")

Hello!

I'm not affiliated with jina, but I do have experience with these types of models.
What I see a lot of that these models perform a bit unexpectedly when the input "type" is a bit different than what the model has seen during training. And this model has (probably) never dealt with single words before, so it may be a bit unexpected there. Perhaps you'll get stronger results with exclusively short sentences instead of individual words?

  • Tom Aarsen

@tomaarsen Thanks so much. I am experimenting with longer phrases and that seems to help.

I tried to use softmax on the results and it looks right to me. The actual distances seem off but perhaps in context things work correctly.

I appreciate your comment.

It seems like this code chooses the right phrase:

Phrase Attack the orc matches attack the ogre
Phrase Explore the area matches investigate the area
Phrase Talk to the villager matches discuss with a merchant

My code:

from transformers import AutoModel
from numpy.linalg import norm
import numpy as np

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method

def softmax(x):
    """
    Compute softmax values for each sets of scores in x.

    >>> s = softmax([ 0.1, 0.3, 0.999 ])
    >>> s
    array([0.21374155, 0.26106452, 0.52519393])
    >>> np.argmax(s)
    2
    """
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

keywords = ['attack the ogre', 'investigate the area', 'discuss with a merchant' ]
phrases = [
    'Attack the orc',
    'Fight baldur the great',
    'Battle with minos the invisible',
    'Talk to the villager',
    'Ask the peasant about the disturbances',
    "Explore the area",
    'Walk about the town',
]

for keyword in keywords:
    k = []
    for phrase in phrases:
        embeddings = model.encode( [ keyword, phrase ])
        sim = cos_sim(embeddings[0], embeddings[1])
        k.append(sim)
    maxes = softmax(k)
    idx = np.argmax(maxes)
    print(f"Phrase {phrases[idx]} matches {keyword}")
    # print(f"{keyword} vs {phrase}: {sim}")

If I swap the iterators, it does not look correct in some cases. Oh well.

$ python /app/simple-jina.py 

Checking phrase: Attack the orc
Cosine distance from attack the ogre is 0.8540812134742737
Cosine distance from investigate the area is 0.7925515174865723
Cosine distance from discuss with a merchant is 0.7492237687110901
Phrase attack the ogre matches Attack the orc
Checking phrase: Fight baldur the great
Cosine distance from attack the ogre is 0.8274071216583252
Cosine distance from investigate the area is 0.7529822587966919
Cosine distance from discuss with a merchant is 0.7617365717887878
Phrase attack the ogre matches Fight baldur the great
Checking phrase: Battle with minos the invisible
Cosine distance from attack the ogre is 0.8110106587409973
Cosine distance from investigate the area is 0.7597004771232605
Cosine distance from discuss with a merchant is 0.7424607872962952
Phrase attack the ogre matches Battle with minos the invisible
Checking phrase: Talk to the villager
Cosine distance from attack the ogre is 0.7795507907867432
Cosine distance from investigate the area is 0.8309686779975891
Cosine distance from discuss with a merchant is 0.8261212706565857
Phrase investigate the area matches Talk to the villager
Checking phrase: Ask the peasant about the disturbances
Cosine distance from attack the ogre is 0.7807489633560181
Cosine distance from investigate the area is 0.8291046619415283
Cosine distance from discuss with a merchant is 0.7955068349838257
Phrase investigate the area matches Ask the peasant about the disturbances
Checking phrase: Explore the area
Cosine distance from attack the ogre is 0.8111925721168518
Cosine distance from investigate the area is 0.9507258534431458
Cosine distance from discuss with a merchant is 0.8034010529518127
Phrase investigate the area matches Explore the area
Checking phrase: Walk about the town
Cosine distance from attack the ogre is 0.7734066843986511
Cosine distance from investigate the area is 0.8358470797538757
Cosine distance from discuss with a merchant is 0.7842782735824585
Phrase investigate the area matches Walk about the town
root@f0d40315a71f:/workspace# fg
emacs /app/simple-jina.py

$ cat /app/simple-jina.py
from transformers import AutoModel
from numpy.linalg import norm
import numpy as np

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method

def softmax(x):
    """
    Compute softmax values for each sets of scores in x.

    >>> s = softmax([ 0.1, 0.3, 0.999 ])
    >>> s
    array([0.21374155, 0.26106452, 0.52519393])
    >>> np.argmax(s)
    2
    """
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

keywords = ['attack the ogre', 'investigate the area', 'discuss with a merchant' ]
phrases = [
    'Attack the orc',
    'Fight baldur the great',
    'Battle with minos the invisible',
    'Talk to the villager',
    'Ask the peasant about the disturbances',
    "Explore the area",
    'Walk about the town',
]

for phrase in phrases:
    print(f"Checking phrase: {phrase}")
    p = []
    for keyword in keywords:
        embeddings = model.encode( [ keyword, phrase ])
        sim = cos_sim(embeddings[0], embeddings[1])
        print(f"Cosine distance from {keyword} is {sim}")
        p.append(sim)
    maxes = softmax(p)
    idx = np.argmax(maxes)
    print(f"Phrase {keywords[idx]} matches {phrase}")

Sign up or log in to comment