MPS support quantification
I'm trying to run this with the transformers library on an M1 Macbook Pro.
With bfloat16, I get:
"TypeError: BFloat16 is not supported on MPS"
With float16, I get:
"NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1
to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS."
Is there a quantized model somewhere that I should be using instead? Any chance of running this model on Apple GPU with the hugging face libraries?
Curious, did you ever get this working?
Hi
@tonimelisma
For using quantized Llama on apple devices, I advise to use MLX: https://huggingface.co./collections/mlx-community/llama-3-662156b069a5d33b3328603c cc
@awni
@prince-canuma
Yes, MLX and llama.cpp work fine. I was inquiring whether Huggingface would work, too.
For mps you need to use torch.float32
A lot of things need changed elsewhere but this solves this particular issue. It's probably safe to assume that you need llama.cpp to run on a mac.