Alibaba-NLP/gte-multilingual-base · HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS

remove token cls architecture86ebdef6

nbroad

Aug 16, 2024

testing to see if this works for TEI

delete commaaff34970

Upload model.safetensors with huggingface_hubf2c9fcdf

nbroad

Aug 16, 2024

•

edited Sep 3, 2024

I can confirm this PR does work for TEI.

If using Inference Endpoints:

go to https://ui.endpoints.huggingface.co/new
put Alibaba-NLP/gte-multilingual-base as the model repository
(optional) set the endpoint name
choose cloud provider, device (CPU/GPU)
select "Advanced Configuration"
select "sentence embeddings" in the "Task" dropdown
put refs/pr/7 in the "Revision" box
In "Environment Variables", set MODEL_ID=/repository

Or if you'd like to use the python client to create the endpoint, you can use the following:

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

If launching from a command line, then you can use

model=Alibaba-NLP/gte-multilingual-base
volume=$PWD/data
revision=refs/pr/7

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision=$revision

nbroad changed pull request title from remove token cls architecture to HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS Aug 16, 2024

nbroad

Aug 16, 2024

•

edited Aug 16, 2024

In short, I had to:

remove NewModelForTokenClassification from architectures in config.json
rename the keys in the safetensors file to not start with "new". compare the new keys with the old keys

izhx

Alibaba-NLP org Aug 17, 2024

Huge thanks!
But we prefer to keep the ForTokenClassification in config.json for sparse weights prediction if it is need by the auto model loading AutoModelForTokenClassification.
I will try to make the existing structure work with TEI, if it is possible.

Will back to you

nbroad

Aug 17, 2024

You don’t need to merge this. People can use this branch for TEI or inference endpoints

sigridjineth

Aug 17, 2024

•

edited Aug 17, 2024

@nbroad @izhx I want to run https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/tree/refs%2Fpr%2F3 and try to allocate id2label and label2id -- still working. have you tried it?

and you mean that using this branch will NOT make the model to infer with sparse weights?

(I am doing some experiments on here but no fruitful results has came yet: https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/discussions/3)

izhx pinned discussion Aug 17, 2024

Maku319

Aug 20, 2024

•

edited Aug 20, 2024

I'm really sorry to bother you, I’ve tried running TEL using Docker and Cargo, but in Docker, it keeps saying that ONNX is missing.

docker run -p 8080:80 -v $volume:/data ${local-image} --model-id $model --revision=$revision
2024-08-20T09:32:14.646243Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "Ali****-***/***-************-*ase", revision: Some("refs/pr/7"), tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "127b8c571d1b", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-08-20T09:32:14.646508Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-20T09:32:14.678359Z  INFO download_pool_config: text_embeddings_core::download: core/src/download.rs:38: Downloading `1_Pooling/config.json`
2024-08-20T09:32:17.348216Z  INFO download_new_st_config: text_embeddings_core::download: core/src/download.rs:62: Downloading `config_sentence_transformers.json`
2024-08-20T09:32:17.704659Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:21: Starting download
2024-08-20T09:32:17.704698Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:23: Downloading `config.json`
2024-08-20T09:32:18.507387Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Downloading `tokenizer.json`
2024-08-20T09:32:22.272394Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:368: Downloading `model.onnx`
2024-08-20T09:32:22.635404Z  WARN download_artifacts: text_embeddings_backend: backends/src/lib.rs:372: Could not download `model.onnx`: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/model.onnx)
2024-08-20T09:32:22.635437Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:373: Downloading `onnx/model.onnx`
thread 'main' panicked at /usr/src/backends/src/lib.rs:316:17:
failed to download `model.onnx` or `model.onnx_data`. Check the onnx file exists in the repository. request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/onnx/model.onnx)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

nbroad

Aug 20, 2024

The way I made it work

model=Alibaba-NLP/gte-multilingual-base
revision=refs/pr/7
volume=/tmp

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision $revision

@Maku319 , what is ${local-image}?

Maku319

Aug 21, 2024

@nbroad Thank you for your reply! Since TEI doesn't provide an image version for the M series chip Macs, I built the image locally using the official TEI repository, and that's the local-image.

nbroad

Aug 21, 2024

•

edited Aug 21, 2024

@Maku319 ,

I'm not sure if there is a solution that works on Mac chips yet. The simplest option to get embeddings quickly would probably be to create an endpoint using Inference Endpoints. You can use the UI here or use the following code to create an endpoint.

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

Maku319

Aug 22, 2024

@nbroad Thank you so much for your reply. I think the ONNX model is only necessary when running on the CPU. When I switched to the GPU, everything seemed to work fine, but now I need to figure out the issue with the container not recognizing CUDA after it starts up. Thanks again!

Maku319

Aug 23, 2024

@nbroad Thank you for your patient guidance. The images for both GPU and CPU versions have been successfully deployed and are accepting requests. However, I have a question: my ONNX model was converted based on the configuration from the main branch, so why is it able to run with the configuration from the pr/7 version you provided?
The command I ran was: docker run -p 9090:80 -v ${PWD}:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.5-grpc --model-id /data/gte-multilingual-base. The converted ONNX model is located at: gte-multilingual-base\onnx\.

Also, if I use the repository from the main branch, it fails instead?
The error message is as follows :

INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/***-************-*ase", revision: None, 
tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "bbf17dcff344", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
Error: `config.json` does not contain `id2label`

nbroad

Aug 25, 2024

I think it's because of the architectures listed in the config file

alvarobartt

Jan 17

Hi @Maku319 , as a small follow-up, just to let you know that TEI 1.6.0 re-introduced the Intel backend for CPU inference, meaning that if the ONNX weights are not there, it will roll back to the safetensors weights, so you should be able to run Alibaba-NLP/gte-multilingual-base on CPU as docker run -p 8080:80 --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080 with no issues, or if you happen to be using an MPS device or don't want to run it over Docker, you can also clone https://github.com/huggingface/text-embeddings-inference and run e.g. cargo install --path router --features metal for MPS support and then just run text-embeddings-router --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080 (more information on the later at https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#local-install).

maxcodefaster

13 days ago

@nbroad Thank you for your work. Can you clarify how one would get only the sparse embeddings and which dimensions they would need?