Serving with TGI or vLLM?
Can the models be served with either of these containers - I did not see any HQQ support in either.
What CUDA level is necessary? AWQ for example on vLLM is currently only available on Ampere onwards (CUDA compute level 7.5+).
Being able to serve Mixtral models from a single V100 32GB (CUDA 7.0) would be a big plus to use those older GPUs, too.
Thanks for your comment @kno10 !
Since our implementation is pure Pytorch, it should work on older GPUs. To use the compile backend (recommended for faster inference), you'd need the minimum required by Torch Dynamo which is 7.0 I think. I tried on the Titan RTX and 2080 Ti and it's working fine, it should work on the V100.
Regarding vLLM serving, we have a version of Llama2: https://github.com/mobiusml/hqq/blob/master/examples/vllm/llama2_example.py . The issue however is that, recently the folks of vLLM made a major refactoring so this only works with vllm <= 0.2.2.
The main challenge with vLLM is that everything is hard-coded to load/run on the GPU, so to make HQQ work without forking the whole library we have to copy the whole model architecture and do a bunch of hacks to force vLLM to not load the original weights on the GPU, and this should be done for each architecture separately.
The cleanest way would be HQQ officially integrated with vLLM, since they have integrated AWQ, adding HQQ shouldn't be too complicated.