jinaai
/

jina-embeddings-v2-base-de

@@ -3115,11 +3115,6 @@ model-index:
 `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
 It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
 We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
-The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
-This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.
-With a standard size of 161 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
 Additionally, we provide the following embedding models:
 - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
@@ -3157,8 +3152,8 @@ def mean_pooling(model_output, attention_mask):
 sentences = ['How is the weather today?', 'What is the current weather like today?']
-tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-small-en')
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -3179,8 +3174,8 @@ from transformers import AutoModel
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
-embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
@@ -3208,8 +3203,9 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
 ## Plans
-The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages.
-The upcoming models will be called `jina-embeddings-v2-base-de/es`.
 ## Contact

 `jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
 It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
 We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
 Additionally, we provide the following embedding models:
 - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
 sentences = ['How is the weather today?', 'What is the current weather like today?']
+tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-de')
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) # trust_remote_code is needed to use the encode method
+embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
 ## Plans
+1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
+2. Multimodal embedding models enable Multimodal RAG applications.
+3. High-performt rerankers.
 ## Contact