Update Readme on usage with Infinity
#36
by
michaelfeil
- opened
Tested with large
,base
,v2
docker run --gpus all -v $PWD/data:/app/.cache -e HF_TOKEN=$HF_TOKEN -p "7993":"7997" michaelf34/infinity:0.0.68 v2 --model-id BAAI/bge-reranker-base --revision "main" --dtype float16 --batch-size 32 --engine torch --port 7997
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-13 00:59:19,095 infinity_emb INFO: infinity_server.py:89
Creating 1engines:
engines=['BAAI/bge-reranker-base']
INFO 2024-11-13 00:59:19,099 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2024-11-13 00:59:19,106 infinity_emb INFO: select_model.py:64
model=`BAAI/bge-reranker-base` selected, using
engine=`torch` and device=`None`
INFO 2024-11-13 01:00:12,731 CrossEncoder.py:125
sentence_transformers.cross_encoder.CrossEncoder
INFO: Use pytorch device: cuda
INFO 2024-11-13 01:00:13,625 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co./docs/optimum/bettertransformer/overview for more details.
INFO 2024-11-13 01:00:13,635 infinity_emb INFO: Switching to torch.py:71
half() precision (cuda: fp16).
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO 2024-11-13 01:00:13,949 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=3
2.71 ms tokenization
8.11 ms inference
0.00 ms post-processing
10.83 ms total
embeddings/sec: 2954.28
INFO 2024-11-13 01:00:14,149 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=512
28.06 ms tokenization
24.17 ms inference
0.01 ms post-processing
52.25 ms total
embeddings/sec: 612.50
@Shitao Can you review?