mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-4bit_g64-HQQ

Thanks for your comment @kno10 !

Since our implementation is pure Pytorch, it should work on older GPUs. To use the compile backend (recommended for faster inference), you'd need the minimum required by Torch Dynamo which is 7.0 I think. I tried on the Titan RTX and 2080 Ti and it's working fine, it should work on the V100.

Regarding vLLM serving, we have a version of Llama2: https://github.com/mobiusml/hqq/blob/master/examples/vllm/llama2_example.py . The issue however is that, recently the folks of vLLM made a major refactoring so this only works with vllm <= 0.2.2.
The main challenge with vLLM is that everything is hard-coded to load/run on the GPU, so to make HQQ work without forking the whole library we have to copy the whole model architecture and do a bunch of hacks to force vLLM to not load the original weights on the GPU, and this should be done for each architecture separately.
The cleanest way would be HQQ officially integrated with vLLM, since they have integrated AWQ, adding HQQ shouldn't be too complicated.

mobiuslabsgmbh
/

Mixtral-8x7B-Instruct-v0.1-hf-4bit_g64-HQQ

Serving with TGI or vLLM?