Apple Silicon not running with device type mps and acceleration
Hi community,
anybody out there having the model running on Apple Silicon with acceleration?
I just managed to run it in 'cpu' mode, so performance is far behind possibilities.
Trouble seems to be in torch.autocast
Found some posts in this direction on other models, but no solution.
Any help would be welcome.
Thanks a lot
Yep same problem here this is error :
Traceback (most recent call last):
File "/Users/x/test/machine-learning/molmo/test.py", line 47, in <module>
output = model.generate_from_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2507, in generate_from_batch
out = super().generate(
^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/transformers/generation/utils.py", line 2139, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/transformers/generation/utils.py", line 3099, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2400, in forward
outputs = self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2179, in forward
attention_bias = get_causal_attention_bias(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 1753, in get_causal_attention_bias
with torch.autocast(device.type, enabled=False):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 229, in __init__
dtype = torch.get_autocast_dtype(device_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: unsupported scalarType
I'm currently searching a solution, I'll let you know if I found something
same for me with allenai/Molmo-7B-O-0924
Hi, I just wanted to ask it this might be solved e.g based on another chat? The problem seems not only for this model but also on other ones. Anybody successful in accelerated LLMs on Apple Silicon?
Thanks
Hello @Troubadix I tried to replicate this and resolve, but it seems to be an issue with Torch, as other models throw the same error. This thread discusses float16 support for operations. However, from my investigation, autocast only works with CPU and CUDA device types. I’ll look into this further and update you if I find anything. Also, the same issue goes with 'meta'. A few useful threads.. [1 2].
@amanrangapur or anyone else on this thread, can you paste some example code that shows the problem?
Here you go @dirkgr
from hf_olmo import OLMoForCausalLM, OLMoTokenizerFast
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
dtype = torch.float16
elif torch.cuda.is_available():
device = torch.device("cuda")
dtype = torch.float16
else:
device = torch.device("cpu")
dtype = torch.float32
print(f"Using device: {device} with dtype: {dtype}")
olmo = OLMoForCausalLM.from_pretrained("allenai/OLMo-1B", torch_dtype=dtype)
tokenizer = OLMoTokenizerFast.from_pretrained("allenai/OLMo-1B")
olmo = olmo.to(device)
message = ["Language modeling is"]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False).to(device)
if device.type == 'mps':
with torch.no_grad():
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
else:
with torch.autocast(device_type=device.type, dtype=dtype):
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])