shard 0 never ready when given the speculator option?
If I use the model card docker command (after using the model card commands to download the models and pull the image TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ee927a4
) I continuously get INFO text_generation_launcher: Waiting for shard 0 to be ready...
docker run -d --rm --gpus all \
--name my-tgis-server \
-p 8033:8033 \
-v $HF_HUB_CACHE:/models \
-e HF_HUB_CACHE=/models \
-e TRANSFORMERS_CACHE=/models \
-e MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct \
-e SPECULATOR_NAME=ibm-fms/llama3-8b-accelerator \
-e FLASH_ATTENTION=true \
-e PAGED_ATTENTION=true \
-e DTYPE=float16 \
$TGIS_IMAGE
2024-05-06T17:35:31.667361Z INFO text_generation_launcher: TGIS Commit hash: ee927a407a27e831d6eb12f564b8e8a23fc33759
2024-05-06T17:35:31.667386Z INFO text_generation_launcher: Launcher args: Args { model_name: "meta-llama/Meta-Llama-3-8B-Instruct", revision: None, deployment_framework: "hf_transformers", dtype: Some("float16"), dtype_str: None, quantize: None, num_shard: None, max_concurrent_requests: 512, max_sequence_length: None, max_new_tokens: 1024, max_batch_size: 12, max_prefill_padding: 0.2, batch_safety_margin: 20, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0, default_include_stop_seqs: true, otlp_endpoint: None, otlp_service_name: None }
2024-05-06T17:35:31.667398Z INFO text_generation_launcher: Inferring num_shard = 1 from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
2024-05-06T17:35:31.667434Z INFO text_generation_launcher: Saving fast tokenizer for `meta-llama/Meta-Llama-3-8B-Instruct` to `/tmp/94d1d104-6e33-45ef-a420-d8359b872014`
/opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-06T17:35:33.892187Z INFO text_generation_launcher: Loaded max_sequence_length from model config.json: 8192
2024-05-06T17:35:33.892208Z INFO text_generation_launcher: Setting PYTORCH_CUDA_ALLOC_CONF to default value: expandable_segments:True
2024-05-06T17:35:33.892401Z INFO text_generation_launcher: Starting shard 0
Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
Shard 0: warnings.warn(
Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama
Shard 0: /opt/tgis/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
Shard 0: warnings.warn(
Shard 0: HAS_BITS_AND_BYTES=False, HAS_GPTQ_CUDA=True, EXLLAMA_VERSION=2, GPTQ_CUDA_TYPE=exllama
Shard 0: Using Flash Attention V2: True
Shard 0: WARNING: Using deployment engine tgis_native rather than hf_transformers because FLASH_ATTENTION is enabled
Shard 0: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-06T17:35:43.901669Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
Shard 0: Prefix cache disabled, using all available memory
Shard 0: Baseline: 16060547072, Free memory: 6999703552
Shard 0: Validating the upper bound
2024-05-06T17:35:53.911076Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:36:03.919521Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:36:13.927673Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
Shard 0: Looking for the linear part
2024-05-06T17:36:23.935499Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:36:33.943447Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
Shard 0: >> fitted model:
Shard 0: >> free_memory: 6999703552
Shard 0: >> linear_fit_params: [519972.74918647]
Shard 0: >> quadratic_fit_params: [0.0, 0.0]
Shard 0: >> next_token_param: [263457.72946989 280219.21707222]
Shard 0: Using Paged Attention
Shard 0: WARNING: Using deployment engine tgis_native rather than hf_transformers because PAGED_ATTENTION is enabled
Shard 0: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-06T17:36:43.953251Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
Shard 0: Loading speculator model from: /models/models--ibm-fms--llama3-8b-accelerator/snapshots/132ff564da081b9fd92735d9d27998dc24948093
Shard 0: Speculation will be enabled up to batch size 16
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-06T17:36:53.961553Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:37:03.969460Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:37:13.977366Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2024-05-06T17:37:23.985230Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
But if I just remove the -e SPECULATOR_NAME=ibm-fms/llama3-8b-accelerator
then it starts and I can get good responses.
This is with an AWS instance type g5.12xlarge (4 x A10 GPU)
It seems to load on one device only (1 shard).
Does it require more memory to than that to get the accelerator loaded?
I also observe this issue when loading on a single A10G. If setting -e NUM_SHARD 2
(or 4), it loads successfully. However, at inference time I observe the following run-time error:
Shard 1: File "/opt/tgis/lib/python3.11/site-packages/fms_extras/utils/cache/paged.py", line 340, in store
Shard 1: key_to_cache = keys.view(-1, self.kv_heads, self.head_size)
Shard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Shard 1: RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Does this model not support sharding and requires more than 24GB VRAM?
We are currently working on TP enablement for this model, so will update you further when that is available. Also we are trying to reproduce the issue you had with loading. Will keep you updated in this thread.
I've reproduced the above loading issue on an L4 machine, will keep updated here.
is an A100 the basic requirement then, in your own experience?
All of our testing has been on A100, though we have also tried on L40 and that works as well. I don't believe this is a hard requirement, but we are looking into the issue on smaller GPUs now as well.
I believe we may be hitting the GPU memory limit here for the llama3 8b + speculator model (Tested this out on L4 machine). -- looking into this further as well if anything else can be done
Also, we are in the process of adding TP support which should solve this issue. Will keep you updated when it is available.
@mhill4980
We have pushed out a new image (quay.io/wxpe/text-gen-server:main.ddc56ee) which enables TP support. We are still investigating an issue with the number of blocks that are created automatically, so you can try to adjust the number of blocks to start with by setting KV_CACHE_MANAGER_NUM_GPU_BLOCKS=300
(adjust this number based on available space). To enable TP support, set NUM_GPUS=2, NUM_SHARD=2
(depending on number of GPUs you want to use).