Error: prefill failed
I am observing an error during inference time prefill operations. Using a single A10G and sample_client.py
. Any idea what is happening with the queue for paged attention here?
docker run -it --runtime=nvidia --gpus all \
-p 8033:8033 \
-v $HF_HUB_CACHE:/models \
-e HF_HUB_CACHE=/models \
-e TRANSFORMERS_CACHE=/models \
-e MODEL_NAME=instructlab/granite-7b-lab \
-e SPECULATOR_NAME=ibm/granite-7b-lab-accelerator \
-e FLASH_ATTENTION=true \
-e PAGED_ATTENTION=true \
-e DTYPE=float16 \
$TGIS_IMAGE
with traceback:
2024-05-07T15:04:59.091453Z INFO text_generation_router::grpc_server: src/grpc_server.rs:79: gRPC server started on port 8033
2024-05-07T15:05:01.092499Z INFO text_generation_router::server: src/server.rs:484: HTTP server started on port 3000
2024-05-07T15:06:00.284156Z INFO text_generation_router::queue: src/queue.rs:410: Chose 1 out of 1 requests from buffer, total now 1
2024-05-07T15:06:00.284229Z INFO text_generation_router::batcher: src/batcher.rs:581: New or updated batch #1 of size 1 (35 total toks), max new toks = 100
Shard 0: ERROR:root:Prefill failed
Shard 0: Traceback (most recent call last):
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 43, in func_with_log
Shard 0: return await func(*args, **kwargs)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 145, in Prefill
Shard 0: output_tokens, input_token_info, decode_errors, forward_time_ns = self.model.generate_token(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/paged_causal_lm.py", line 595, in generate_token
Shard 0: t_forward_ns = self._prefill(
Shard 0: ^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/paged_causal_lm.py", line 408, in _prefill
Shard 0: input_ids, position_ids, cache_data = prepare_inputs_for_prefill(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py", line 117, in prepare_inputs_for_prefill
Shard 0: cache_data = kv_cache_manager.allocate_tokens(num_tokens_per_sequence)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1211, in allocate_tokens
Shard 0: return self._allocate_prompt_tokens(num_tokens_per_sequence)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1220, in _allocate_prompt_tokens
Shard 0: sequence_ids = self._get_unassigned_sequence_ids(len(num_tokens_per_sequence))
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1097, in _get_unassigned_sequence_ids
Shard 0: return [self.unused_keys.get_nowait() for _ in range(num_sequences)]
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1097, in <listcomp>
Shard 0: return [self.unused_keys.get_nowait() for _ in range(num_sequences)]
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/queue.py", line 199, in get_nowait
Shard 2024-05-07T15:06:00.313019Z INFO text_generation_router::batcher: src/batcher.rs:641: Prefill took 28.776335ms for 1 inputs, 35 total tokens
0: return self.get(block=False)2024-05-07T15:06:00.313066Z ERROR generate_stream{input="Below is an instruction that des..." prefix_id=None correlation_id="<none>" input_bytes=144 params=Some(Parameters { method: Greedy, sampling: None, stopping: Some(StoppingCriteria { max_new_tokens: 100, min_new_tokens: 100, time_limit_millis: 0, stop_sequences: [], include_stop_sequence: None }), response: None, decoding: None, truncate_input_tokens: 0 })}: text_generation_router::grpc_server: src/grpc_server.rs:296: Streaming response failed after 0 tokens, output so far: '""': Request failed during generation: Unexpected <class '_queue.Empty'>:
Shard 0: ^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/queue.py", line 168, in get
Shard 0: raise Empty
Shard 0: _queue.Empty
Shard 0: ERROR:grpc._cython.cygrpc:Unexpected [Empty] raised by servicer method [/generate.v1.TextGenerationService/Prefill]
Shard 0: Traceback (most recent call last):
Shard 0: File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
Shard 0: File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
Shard 0: File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
Shard 0: File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 43, in func_with_log
Shard 0: return await func(*args, **kwargs)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/server.py", line 145, in Prefill
Shard 0: output_tokens, input_token_info, decode_errors, forward_time_ns = self.model.generate_token(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/paged_causal_lm.py", line 595, in generate_token
Shard 0: t_forward_ns = self._prefill(
Shard 0: ^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/paged_causal_lm.py", line 408, in _prefill
Shard 0: input_ids, position_ids, cache_data = prepare_inputs_for_prefill(
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py", line 117, in prepare_inputs_for_prefill
Shard 0: cache_data = kv_cache_manager.allocate_tokens(num_tokens_per_sequence)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1211, in allocate_tokens
Shard 0: return self._allocate_prompt_tokens(num_tokens_per_sequence)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1220, in _allocate_prompt_tokens
Shard 0: sequence_ids = self._get_unassigned_sequence_ids(len(num_tokens_per_sequence))
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1097, in _get_unassigned_sequence_ids
Shard 0: return [self.unused_keys.get_nowait() for _ in range(num_sequences)]
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 1097, in <listcomp>
Shard 0: return [self.unused_keys.get_nowait() for _ in range(num_sequences)]
Shard 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/queue.py", line 199, in get_nowait
Shard 0: return self.get(block=False)
Shard 0: ^^^^^^^^^^^^^^^^^^^^^
Shard 0: File "/opt/tgis/lib/python3.11/queue.py", line 168, in get
Shard 0: raise Empty
Shard 0: _queue.Empty
I have a feeling this has to do with not having enough memory for blocks to allocate. I have reproduced this with an L4 machine and will update here when I have more info.
We have pushed a new image quay.io/wxpe/text-gen-server:main.e87d462
which shows the number of GPU blocks that are pre-allocated at load time. It seems that with granite 7b, 0 blocks were being allocated which resulted in the above error. For now, you can override this by setting KV_CACHE_MANAGER_NUM_GPU_BLOCKS=150
. We noticed that this number can be increased more, but at least for this sample to try it out, 150 should be enough (feel free to increase it if you see the GPU can handle more blocks - we tested this on an L4 GPU). We are looking to add TP support for this as well, so that should help with using multiple GPUs.
Thanks Joshua, I can confirm this new container + env variable allows successful inference with 24GB VRAM. Looking forward to sharding support!
@ulrichkr
We have pushed out a new image (quay.io/wxpe/text-gen-server:main.ddc56ee) which enables TP support. We are still investigating an issue with the number of blocks that are created automatically, so please continue to set the KV_CACHE_MANAGER_NUM_GPU_BLOCKS
manually (adjust this number based on available space). To enable TP support, set NUM_GPUS=2, NUM_SHARD=2 (depending on number of GPUs you want to use).