multi-qa-MiniLM-distill-onnx-L6-cos-v1

This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search

Usage (ONNX runtime)

Using optimum

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

from transformers import Pipeline
import torch.nn.functional as F
import torch 

# copied from the model card
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class SentenceEmbeddingPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        # we don't have any hyperameters to sanitize
        preprocess_kwargs = {}
        return preprocess_kwargs, {}, {}
      
    def preprocess(self, inputs):
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
        return encoded_inputs

    def _forward(self, model_inputs):
        outputs = self.model(**model_inputs)
        return {"outputs": outputs, "attention_mask": model_inputs["attention_mask"]}

    def postprocess(self, model_outputs):
        # Perform pooling
        sentence_embeddings = mean_pooling(model_outputs["outputs"], model_outputs['attention_mask'])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings

# load optimized model
model_name = "rawsh/multi-qa-MiniLM-distill-onnx-L6-cos-v1"
model = ORTModelForFeatureExtraction.from_pretrained(model_name, file_name="model_quantized.onnx")

# create optimized pipeline
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
optimized_emb = SentenceEmbeddingPipeline(model=model, tokenizer=tokenizer)
pred1 = optimized_emb("Hello world!")
pred2 = optimized_emb("I hate everything.")

print(pred1[0].dot(pred2[0]))

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

PyTorch Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

TensorFlow Usage (HuggingFace Transformers)

Similarly to the PyTorch example above, to use the model with TensorFlow you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32)
    return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')

    # Compute token embeddings
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = tf.math.l2_normalize(embeddings, axis=1)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model = TFAutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Technical Details

In the following some technical details how this model must be used:

Setting Value
Dimensions 384
Produces normalized embeddings Yes
Pooling-Method Mean pooling
Suitable score functions dot-product (util.dot_score), cosine-similarity (util.cos_sim), or euclidean distance

Note: When loaded with sentence-transformers, this model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.


Background

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

We developped this model during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We developped this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.

Intended uses

Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.

Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.

Training procedure

The full training script is accessible in this current repository: train_script.py.

Pre-training

We use the pretrained nreimers/MiniLM-L6-H384-uncased model. Please refer to the model card for more detailed information about the pre-training procedure.

Training

We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs. We sampled each dataset given a weighted probability which configuration is detailed in the data_config.json file.

The model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.

Dataset Number of training tuples
WikiAnswers Duplicate question pairs from WikiAnswers 77,427,422
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia 64,371,441
Stack Exchange (Title, Body) pairs from all StackExchanges 25,316,456
Stack Exchange (Title, Answer) pairs from all StackExchanges 21,396,559
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine 17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet 3,012,496
Amazon-QA (Question, Answer) pairs from Amazon product pages 2,448,839
Yahoo Answers (Title, Answer) pairs from Yahoo Answers 1,198,260
Yahoo Answers (Question, Answer) pairs from Yahoo Answers 681,164
Yahoo Answers (Title, Question) pairs from Yahoo Answers 659,896
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question 582,261
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) 325,475
Stack Exchange Duplicate questions pairs (titles) 304,525
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset 103,663
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph 100,231
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset 87,599
TriviaQA (Question, Evidence) pairs 73,346
Total 214,988,242
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train rawsh/multi-qa-MiniLM-distill-onnx-L6-cos-v1