vidore/colpali-v1.2 · Update example with infinity

Thanks for accepting the PR here: https://huggingface.co./vidore/colpali-v1.2-merged/commit/cd80ee4200c591b788a9c4e21bb5d549d4a04637
Made ColPali run with infinity. I would recommend using a encoding=base64 when sending requests.
Wrote a unit test to assert identical behaviour: https://github.com/michaelfeil/infinity/blob/774530bd54bfb98e3db70d3248140bac99baa938/libs/infinity_emb/tests/unit_test/transformer/vision/test_torch_vision.py#L50
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" michaelf34/infinity:0.0.69 v2 --model-id vidore/colpali-v1.2-merged --revision "cd80ee4200c591b788a9c4e21bb5d549d4a04637" --dtype bfloat16 --batch-size 8 --device cuda --engine torch --port 7997
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-15 19:49:10,249 infinity_emb INFO:        infinity_server.py:89
         Creating 1engines:                                                     
         engines=['vidore/colpali-v1.2-merged']                                                                          
INFO     2024-11-15 19:49:10,260 infinity_emb INFO:           select_model.py:64
         model=`vidore/colpali-v1.2-merged` selected, using                     
         engine=`torch` and device=`cuda`                                       
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
INFO     2024-11-15 19:49:23,588 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=8 and avg tokens per                            
         sentence=1028                                                          
                 22.27    ms tokenization                                       
                 186.14   ms inference                                          
                 600.23   ms post-processing                                    
                 808.64   ms total                                              
         embeddings/sec: 4.95                                                   
INFO     2024-11-15 19:49:25,783 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=8 and avg tokens per                            
         sentence=1044                                                          
                 17.65    ms tokenization                                       
                 455.30   ms inference                                          
                 590.60   ms post-processing                                    
                 1063.54          ms total                                      
         embeddings/sec: 3.76                                                   
INFO     2024-11-15 19:49:25,785 infinity_emb INFO: model    select_model.py:104
         warmed up, between 3.76-4.95 embeddings/sec at                         
         batch_size=4                                                           
INFO     2024-11-15 19:49:25,786 infinity_emb INFO:         batch_handler.py:386
         creating batching engine                                               
INFO     2024-11-15 19:49:25,788 infinity_emb INFO: ready   batch_handler.py:453
         to batch requests.                                                     
INFO     2024-11-15 19:49:25,789 infinity_emb INFO:       infinity_server.py:104
                                                                                
         ♾️  Infinity - Embedding Inference Server                               
         MIT License; Copyright (c) 2023-now Michael Feil                       
         Version 0.0.69                                                         
                                                                                
         Open the Docs via Swagger UI:                                          
         http://0.0.0.0:7997/docs                                               
                                                                                
         Access all deployed models via 'GET':                                  
         curl http://0.0.0.0:7997/models                                        
                                                                                
         Visit the docs for more information:                                   
         https://michaelfeil.github.io/infinity                                 
                                                                                
                                                                                
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)