Update README.md
Browse files
README.md
CHANGED
@@ -3115,11 +3115,6 @@ model-index:
|
|
3115 |
`jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
|
3116 |
It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
|
3117 |
We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
|
3118 |
-
|
3119 |
-
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
|
3120 |
-
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc.
|
3121 |
-
|
3122 |
-
With a standard size of 161 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
|
3123 |
Additionally, we provide the following embedding models:
|
3124 |
|
3125 |
- [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
|
@@ -3157,8 +3152,8 @@ def mean_pooling(model_output, attention_mask):
|
|
3157 |
|
3158 |
sentences = ['How is the weather today?', 'What is the current weather like today?']
|
3159 |
|
3160 |
-
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-
|
3161 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-
|
3162 |
|
3163 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
3164 |
|
@@ -3179,8 +3174,8 @@ from transformers import AutoModel
|
|
3179 |
from numpy.linalg import norm
|
3180 |
|
3181 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
3182 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-
|
3183 |
-
embeddings = model.encode(['How is the weather today?', '
|
3184 |
print(cos_sim(embeddings[0], embeddings[1]))
|
3185 |
```
|
3186 |
|
@@ -3208,8 +3203,9 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
|
|
3208 |
|
3209 |
## Plans
|
3210 |
|
3211 |
-
|
3212 |
-
|
|
|
3213 |
|
3214 |
## Contact
|
3215 |
|
|
|
3115 |
`jina-embeddings-v2-base-de` is a German/English bilingual text **embedding model** supporting **8192 sequence length**.
|
3116 |
It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
|
3117 |
We have designed it for high performance in mongolingual & cross-language applications and trained it specifically to support mixed German-English input without bias.
|
|
|
|
|
|
|
|
|
|
|
3118 |
Additionally, we provide the following embedding models:
|
3119 |
|
3120 |
- [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
|
|
|
3152 |
|
3153 |
sentences = ['How is the weather today?', 'What is the current weather like today?']
|
3154 |
|
3155 |
+
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-de')
|
3156 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)
|
3157 |
|
3158 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
3159 |
|
|
|
3174 |
from numpy.linalg import norm
|
3175 |
|
3176 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
3177 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True) # trust_remote_code is needed to use the encode method
|
3178 |
+
embeddings = model.encode(['How is the weather today?', 'Wie ist das Wetter heute?'])
|
3179 |
print(cos_sim(embeddings[0], embeddings[1]))
|
3180 |
```
|
3181 |
|
|
|
3203 |
|
3204 |
## Plans
|
3205 |
|
3206 |
+
1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
|
3207 |
+
2. Multimodal embedding models enable Multimodal RAG applications.
|
3208 |
+
3. High-performt rerankers.
|
3209 |
|
3210 |
## Contact
|
3211 |
|