Error running the example code
I am trying to run the example code in a multi-gpu setting but it's failing :(
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"
inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
Output:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)
Hm very weird, can you try to use the latest versions of transformers
& accelerate
?
pip install --upgrade accelerate
pip install --upgrade git+https://github.com/huggingface/transformers.git@ùain
@will33am
You need to play around with device_map
since UL2 includes T5 blocks with residual connections which causes an error the blocks to be split across multiple GPUs (Ref: https://github.com/huggingface/blog/blob/main/accelerate-large-models.md).
Use no_split_module_classes=["T5Block"]
and also map the lm_head
to the same device as the embedding layer.
Here's an example script that works on my env with 3 low memory (16GB) GPUs.
https://github.com/akkikiki/huggingface_examples/blob/main/examples/load_flan_ul2.py
@ybelkada
This is the same issue as what we encountered the other day when you were working on fixing multi-gpu settings for BLIP-2 :)
https://github.com/huggingface/transformers/pull/21707
EDIT: Fixed some grammatical mistakes.
@akkikiki
When I try your script load_flan_ul2.py on a single 16GB GPU I get this error:
ValueError: If you want to offload some keys to cpu
or disk
, you need to set load_in_8bit_fp32_cpu_offload=True
. Note that these modules will not be converted to 8-bit but kept in 32-bit.
@SamuelAzran Yeah, that script assumes you have four 16 GB RAM GPUs and you need to offload it to CPU when you have only one.
@diegomontoya The reason for setting max_memory[0]=10GiB
is because of moving lm_head
to GPU 0 in an ad-hoc way (and loading the input tensor to GPU 0 before running forward pass). Otherwise, it'll encounter the same RuntimeError: Expected all tensors to be on the same device,
when you run model.generate
.
You can play around with this max memory (it does not have to be 10GiB, and there may be smarter ways of doing this), but without it, Accelerate does not consider this action of ad-hoc moving of the lm_head
and causes GPU OOM on GPU 0.
@akkikiki Thanks for sharing an example script to run flan-ul2 on multi gpu. I've tried it on an instance with 4 V100 GPU (each has 16 GB memory). It didn't throw any error but the output didn't look correct to me either.
I got the following output when I run your script (without change anything except file name):
python flan_ul2_runbook.py
/opt/conda/envs/flanul2/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py:98: UserWarning: /opt/conda/envs/flanul2 did not contain libcudart.so as expected! Searching further paths...
warn(
CUDA SETUP: CUDA path found: /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
CUDA_SETUP: Detected CUDA version 116
CUDA_SETUP: Loading binary /opt/conda/envs/flanul2/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so...
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
In case it helps, here are my setup:
>>> torch.__version__
'1.13.1+cu116'
>>> transformers.__version__
'4.26.1'
>>> accelerate.__version__
'0.17.1
Do you have any idea, what would be the problem?
@cyt78
Have a look at https://github.com/huggingface/transformers/issues/21987 :)
TL; DR: Play around with N
in BitsAndBytesConfig(llm_int8_threshold=N)
@cyt78
N=5
worked for me. Basically it's a trade-off betw. memory usage and accuracy (on V100, which does not have int8 support on the hardware level. I believe it's different story for A100 and others).
If you are talking about https://github.com/akkikiki/huggingface_examples/blob/main/examples/load_flan_ul2.py#L17 , then it does use load on int8 with load_in_8bit=True
@akkikiki My account was created today and therefore I cant' post any more comments today. So, I'll reply with this new account:).
Yes, I was talking about the script that you pointed out and realised that it indeed use int8. My bad! I've tried with couple of different N values ranging from 1.0 to 10.0 including 5.0 and I got CUDA out of memory error every single time. I found this interesting since you mentioned that you could manage to run it on 3 GPUs with 16 gb memory each. I'm trying to run the exact same script on 4 GPUs each has 16GB memory. Can you think of any possible reason which might lead to Out of memory error in my case?
Here is the changes I did on the script to integrate your suggestion:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map=device_map, load_in_8bit=True,quantization_config=quantization_config)
@cyt79
Yeah, looks like there's more memory usage if the quantization threshold is lower (not 100% sure why), so the 3 GPUs example is just without setting BitsAndBytesConfig(llm_int8_threshold=5.0)
. Best to play around with lowering max_memory
to avoid it (and CPU offloading if needed).
@akkikiki
Thanks for sharing the example code and all explanations! I would like to ask a relevant question about specifying the no_split_module_classes
parameter since there's no official documentation of infer_auto_device_map
. I looked into their source code, and it seems that it gets the components of the model by checking model.named_parameters()
. However, I didn't find anything has the name T5Block
in those parameters of flan_ul2. I wonder if it's defined elsewhere and how you find a way to specify this parameter properly?
I also tried to run the original T5 model basically by replacing flan_ul2
with t5-large
, but it triggered the Expected all tensors to be on the same device
error too, and this time specifying no_split_module_classes=["T5Block"]
wouldn't help, neither did moving lm_head
to gpu:0
. Does that indicate each time we want to run a model we'll have to check the source code and look for the residual connections and preserve them by specifying no_split_module_classes
? Thanks!
EDIT: There is no problem doing inference with t5-large
by swapping flan_ul2
with t5-large
. The multiple device error was caused by the fine-tuning part of my code.
@YzyLmc
I believe you should share your script on t5-large
to share more context since I did not have any trouble with swapping flan_ul2
with t5-large
. As long as it's in the same T5 family, it's not the issue with no_split_module_classes
and the cause is different.
How I found out is basically "connecting the dots" from the Accelerate documentation (https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) on OPTDecoderLayer
, reading the BLOOM blog esp. on the naive pipeline parallelism section (https://huggingface.co./blog/bloom-megatron-deepspeed#pipeline-parallelism) to understand the basic assumption that layers should not be dispatched across multiple GPUs, my experience on the fix in loading Blip-2 Flan-T5-XL with multiple GPUs https://github.com/huggingface/transformers/pull/21707 (hinting from what ybelkada@ raised as a warning in his PR), and getting my hands dirty by actually debugging through printing out the named_parameters
.
I believe with other types of multi-gpu parallelism (e.g., Tensor Parallelism or TP) where we do not have to assume that same layer (or specifically the weight Tensor associated with that layer) are be on same GPU (I guess, not an expert with TP so somebody correct me if I'm wrong ), then probably we do not have to much care about the no_split_module_classes
but somebody have to teach us how in a simple way :)
@akkikiki
Thanks for your quick response! I just ran more tests, and you are totally correct that no_split_module_classes
wasn't the issue in my case. I was trying to fine-tune the model, and it was the fine-tuning part that caused this error, which is a separate issue, and the inference worked perfectly with swapping flan_ul2
with t5-large
. I'll edit my earlier post. Sorry about that!
Also thank you for sharing your experience and insights. I can imagine how much effort you have put into this to make it work, and I hope huggingface people will make clear documentation on this to make it less burdensome.
@akkikiki Many thanks for your reply! I've moved to a bigger instance which has 8 V100 GPUs (each has 32GB memory). Here, I could run the official example code in the model card tab in btfloat16 and got the exacted result. Then, I tried to run your script again. This time, I got a different error:
Traceback (most recent call last):
File "flan_runbook.py", line 18, in <module>
device_map['lm_head'] = device_map["decoder.embed_tokens"]
KeyError: 'decoder.embed_tokens'
I guess "decoder.embed_tokens" has been renamed after you implemented this script. Do you know where can I check the latest name of this?
@akkikiki When possible, can you give me some pointers on how to fix the above error? Many thanks in advance!
@akkikiki I ran inference on blip2_flant5xxl model in a two 3090 environment. Following (https://github.com/huggingface/transformers/pull/21707), I use
configuration = Blip2Config.from_pretrained("Salesforce/blip2-flan-t5-xxl")
with init_empty_weights():
model = Blip2ForConditionalGeneration(configuration)
device_map = infer_auto_device_map(model, no_split_module_classes=["T5Block"], max_memory={0: "24GiB", 1: "24GiB"})
device_map['language_model.lm_head'] = device_map["language_model.decoder.embed_tokens"] # to make the genearted tokens and input_ids to be on the same device
model = Blip2ForConditionalGeneration(configuration).from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map=device_map, cache_dir="/mnt/14T-disk/code/HF_model/hub")
and device_map is:
{'query_tokens': 0, 'vision_model': 0, 'qformer': 0, 'language_projection': 0, 'language_model.shared': 0, 'language_model.decoder.embed_tokens': 0, 'language_model.encoder': 0, 'language_model.decoder.block.0': 0, 'language_model.decoder.block.1': 1, 'language_model.decoder.block.2': 1, 'language_model.decoder.block.3': 1, 'language_model.decoder.block.4': 1, 'language_model.decoder.block.5': 1, 'language_model.decoder.block.6': 1, 'language_model.decoder.block.7': 1, 'language_model.decoder.block.8': 1, 'language_model.decoder.block.9': 1, 'language_model.decoder.block.10': 1, 'language_model.decoder.block.11': 1, 'language_model.decoder.block.12': 1, 'language_model.decoder.block.13': 1, 'language_model.decoder.block.14': 1, 'language_model.decoder.block.15': 1, 'language_model.decoder.block.16': 1, 'language_model.decoder.block.17': 1, 'language_model.decoder.block.18': 1, 'language_model.decoder.block.19': 1, 'language_model.decoder.block.20': 1, 'language_model.decoder.block.21': 1, 'language_model.decoder.block.22': 1, 'language_model.decoder.block.23': 1, 'language_model.decoder.final_layer_norm': 1, 'language_model.decoder.dropout': 1, 'language_model.lm_head': 0}
However, when I executed the inference, I received the following error message but still got the inference result of the model. I don't understand why this happened. Is the result reliable in this case?Thank you.
--- Logging error ---
Traceback (most recent call last):
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 678, in format
record.message = record.getMessage()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in
app.launch_new_instance()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/traitlets/config/application.py", line 992, in launch_instance
app.start()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 711, in start
self.io_loop.start()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 215, in start
self.asyncio_loop.run_forever()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
handle._run()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue
await self.process_one()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 499, in process_one
await dispatch(*args)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell
await result
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
reply_content = await reply_content
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 411, in do_execute
res = shell.run_cell(
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 531, in run_cell
return super().run_cell(*args, **kwargs)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
result = self._run_cell(
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
result = runner(coro)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in pseudo_sync_runner
coro.send(None)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
if await self.run_code(code, result, async=asy):
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/tmp/ipykernel_1716702/309436947.py", line 11, in
out = model.generate(**inputs)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 1805, in generate
self._preprocess_accelerate()
File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 1607, in _preprocess_accelerate
logger.warning(
Message: 'The language_model
is not in the hf_device_map
dictionary and you are running your script in a multi-GPU environment. this may lead to unexpected behavior when using accelerate
. Please pass a device_map
that contains language_model
to remove this warning. Please refer to https://github.com/huggingface/blog/blob/main/accelerate-large-models.md for'
Arguments: (' more details on creating a device_map
for large models.',)
@WHL95 did you get any solution for this? I am also getting the same error.
The following solution worked for me. The error was due to the splitting of language model layers between available GPU space.
import torch
from transformers import (
Blip2VisionConfig,
Blip2QFormerConfig,
OPTConfig,
Blip2Config,
Blip2ForConditionalGeneration,
Blip2Processor,
)
from accelerate import init_empty_weights, infer_auto_device_map
from accelerate.utils import get_balanced_memory
model_id = "Salesforce/blip2-opt-6.7b"
config = Blip2Config.from_pretrained(model_id)
processor = Blip2Processor.from_pretrained(model_id )
with init_empty_weights():
model = Blip2ForConditionalGeneration(config)
max_memory = get_balanced_memory(model, max_memory=None, no_split_module_classes=["OPTDecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"], dtype=torch.float16, low_zero=False,)
device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"], dtype=torch.float16, max_memory=max_memory)
device_map['language_model.lm_head'] = device_map['language_model.model.decoder.embed_tokens']
model = Blip2ForConditionalGeneration.from_pretrained(model_id , device_map=device_map, torch_dtype=torch.float16)