Apple Silicon not running with device type mps and acceleration

#21
by Troubadix - opened

Hi community,
anybody out there having the model running on Apple Silicon with acceleration?
I just managed to run it in 'cpu' mode, so performance is far behind possibilities.
Trouble seems to be in torch.autocast
Found some posts in this direction on other models, but no solution.
Any help would be welcome.

Thanks a lot

Yep same problem here this is error :

Traceback (most recent call last):
  File "/Users/x/test/machine-learning/molmo/test.py", line 47, in <module>
    output = model.generate_from_batch(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2507, in generate_from_batch
    out = super().generate(
          ^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/transformers/generation/utils.py", line 2139, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/transformers/generation/utils.py", line 3099, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2400, in forward
    outputs = self.model.forward(
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 2179, in forward
    attention_bias = get_causal_attention_bias(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/.cache/huggingface/modules/transformers_modules/allenai/Molmo-7B-D-0924/9a41170cfeabb13467ece5a6a5826d7fd68cbe52/modeling_molmo.py", line 1753, in get_causal_attention_bias
    with torch.autocast(device.type, enabled=False):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/test/machine-learning/testenv/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 229, in __init__
    dtype = torch.get_autocast_dtype(device_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: unsupported scalarType

I'm currently searching a solution, I'll let you know if I found something

same for me with allenai/Molmo-7B-O-0924

Hi, I just wanted to ask it this might be solved e.g based on another chat? The problem seems not only for this model but also on other ones. Anybody successful in accelerated LLMs on Apple Silicon?
Thanks

Hello @Troubadix I tried to replicate this and resolve, but it seems to be an issue with Torch, as other models throw the same error. This thread discusses float16 support for operations. However, from my investigation, autocast only works with CPU and CUDA device types. I’ll look into this further and update you if I find anything. Also, the same issue goes with 'meta'. A few useful threads.. [1 2].

Ai2 org

@amanrangapur or anyone else on this thread, can you paste some example code that shows the problem?

Here you go @dirkgr

from hf_olmo import OLMoForCausalLM, OLMoTokenizerFast
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    dtype = torch.float16 
elif torch.cuda.is_available():
    device = torch.device("cuda")
    dtype = torch.float16 
else:
    device = torch.device("cpu")
    dtype = torch.float32
print(f"Using device: {device} with dtype: {dtype}")


olmo = OLMoForCausalLM.from_pretrained("allenai/OLMo-1B", torch_dtype=dtype)
tokenizer = OLMoTokenizerFast.from_pretrained("allenai/OLMo-1B")

olmo = olmo.to(device)
message = ["Language modeling is"]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False).to(device)
if device.type == 'mps':
    with torch.no_grad():
        response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
else:
    with torch.autocast(device_type=device.type, dtype=dtype):
        response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Sign up or log in to comment