Deployment framework
#2
by
xro7
- opened
What framework did you use to deploy the model? I tried vllm with 8xH100 but got the following error.
2025-01-22T13:22:49.476492425Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477126901Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477129206Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...```
Can you provide the full log and your start up command?
I kept the logs for my 4xH200 experiment but got the same error for 8xH100
vllm parameters:--host 0.0.0.0 --port 8000 --model cognitivecomputations/DeepSeek-R1-AWQ --gpu-memory-utilization 0.95 --tensor-parallel-size=4 --trust_remote_code
Logs:
2025-01-22T13:07:50.421133598Z INFO 01-22 05:07:50 api_server.py:712] vLLM API server version 0.6.6.post1
2025-01-22T13:07:50.421303357Z INFO 01-22 05:07:50 api_server.py:713] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='cognitivecomputations/DeepSeek-R1-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-01-22T13:07:50.430906961Z INFO 01-22 05:07:50 api_server.py:199] Started engine process with PID 89
2025-01-22T13:07:50.643475046Z INFO 01-22 05:07:50 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:53.969666636Z INFO 01-22 05:07:53 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:55.208259634Z INFO 01-22 05:07:55 config.py:510] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-01-22T13:07:55.844302051Z INFO 01-22 05:07:55 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:55.888077160Z INFO 01-22 05:07:55 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:55.888171171Z WARNING 01-22 05:07:55 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:55.888191894Z WARNING 01-22 05:07:55 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:58.487429442Z INFO 01-22 05:07:58 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
2025-01-22T13:07:59.106749422Z INFO 01-22 05:07:59 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:59.150778826Z INFO 01-22 05:07:59 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:59.150878529Z WARNING 01-22 05:07:59 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:59.150900534Z WARNING 01-22 05:07:59 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:59.173686852Z INFO 01-22 05:07:59 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cognitivecomputations/DeepSeek-R1-AWQ', speculative_config=None, tokenizer='cognitivecomputations/DeepSeek-R1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=cognitivecomputations/DeepSeek-R1-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-01-22T13:07:59.578249195Z WARNING 01-22 05:07:59 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2025-01-22T13:07:59.583556350Z INFO 01-22 05:07:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2025-01-22T13:07:59.644588714Z INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700087505Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700196810Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719623814Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719626424Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719717955Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719719661Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:08:03.041911052Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.041943685Z INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042058901Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042067625Z INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042084177Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042089699Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042269576Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042297664Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:04.762844790Z INFO 01-22 05:08:04 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714251803Z INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714368438Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714371653Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714609456Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.747454967Z INFO 01-22 05:08:19 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_ed9c7126'), local_subscribe_port=53933, remote_subscribe_port=None)
2025-01-22T13:08:19.783713362Z INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.783863117Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784442981Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784445640Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:20.194644565Z Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.194662173Z INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234273784Z [1;36m(VllmWorkerProcess pid=361)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234294554Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234579652Z [1;36m(VllmWorkerProcess pid=362)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234583739Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.243174528Z [1;36m(VllmWorkerProcess pid=363)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.243179760Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:20:31.182095071Z
Loading safetensors checkpoint shards: 0% Completed | 0/74 [00:00<?, ?it/s]
2025-01-22T13:20:31.489554762Z
Loading safetensors checkpoint shards: 1% Completed | 1/74 [00:00<00:22, 3.25it/s]
2025-01-22T13:20:31.974300115Z
Loading safetensors checkpoint shards: 3% Completed | 2/74 [00:00<00:29, 2.43it/s]
2025-01-22T13:20:32.455546706Z
Loading safetensors checkpoint shards: 4% Completed | 3/74 [00:01<00:31, 2.25it/s]
2025-01-22T13:20:32.926504974Z
Loading safetensors checkpoint shards: 5% Completed | 4/74 [00:01<00:31, 2.20it/s]
2025-01-22T13:20:33.397254243Z
Loading safetensors checkpoint shards: 7% Completed | 5/74 [00:02<00:31, 2.17it/s]
2025-01-22T13:20:33.875270023Z
Loading safetensors checkpoint shards: 8% Completed | 6/74 [00:02<00:31, 2.14it/s]
2025-01-22T13:20:34.344715583Z
Loading safetensors checkpoint shards: 9% Completed | 7/74 [00:03<00:31, 2.14it/s]
2025-01-22T13:20:34.821748448Z
Loading safetensors checkpoint shards: 11% Completed | 8/74 [00:03<00:31, 2.13it/s]
2025-01-22T13:20:35.290056371Z
Loading safetensors checkpoint shards: 12% Completed | 9/74 [00:04<00:30, 2.13it/s]
2025-01-22T13:20:35.755523220Z
Loading safetensors checkpoint shards: 14% Completed | 10/74 [00:04<00:29, 2.13it/s]
2025-01-22T13:20:36.228502702Z
Loading safetensors checkpoint shards: 15% Completed | 11/74 [00:05<00:29, 2.13it/s]
2025-01-22T13:20:36.700871980Z
Loading safetensors checkpoint shards: 16% Completed | 12/74 [00:05<00:29, 2.12it/s]
2025-01-22T13:20:37.183470090Z
Loading safetensors checkpoint shards: 18% Completed | 13/74 [00:06<00:28, 2.11it/s]
2025-01-22T13:20:37.657741308Z
Loading safetensors checkpoint shards: 19% Completed | 14/74 [00:06<00:28, 2.11it/s]
2025-01-22T13:20:38.121128353Z
Loading safetensors checkpoint shards: 20% Completed | 15/74 [00:06<00:27, 2.12it/s]
2025-01-22T13:20:38.589453375Z
Loading safetensors checkpoint shards: 22% Completed | 16/74 [00:07<00:27, 2.13it/s]
2025-01-22T13:20:39.047142026Z
Loading safetensors checkpoint shards: 23% Completed | 17/74 [00:07<00:26, 2.14it/s]
2025-01-22T13:20:39.491344292Z
Loading safetensors checkpoint shards: 24% Completed | 18/74 [00:08<00:25, 2.18it/s]
2025-01-22T13:20:39.929711441Z
Loading safetensors checkpoint shards: 26% Completed | 19/74 [00:08<00:24, 2.21it/s]
2025-01-22T13:20:40.374986470Z
Loading safetensors checkpoint shards: 27% Completed | 20/74 [00:09<00:24, 2.22it/s]
2025-01-22T13:20:40.818969728Z
Loading safetensors checkpoint shards: 28% Completed | 21/74 [00:09<00:23, 2.23it/s]
2025-01-22T13:20:41.273748530Z
Loading safetensors checkpoint shards: 30% Completed | 22/74 [00:10<00:23, 2.22it/s]
2025-01-22T13:20:41.739147123Z
Loading safetensors checkpoint shards: 31% Completed | 23/74 [00:10<00:23, 2.20it/s]
2025-01-22T13:20:42.188972601Z
Loading safetensors checkpoint shards: 32% Completed | 24/74 [00:11<00:22, 2.21it/s]
2025-01-22T13:20:42.641780672Z
Loading safetensors checkpoint shards: 34% Completed | 25/74 [00:11<00:22, 2.21it/s]
2025-01-22T13:20:43.096641696Z
Loading safetensors checkpoint shards: 35% Completed | 26/74 [00:11<00:21, 2.20it/s]
2025-01-22T13:20:43.567797093Z
Loading safetensors checkpoint shards: 36% Completed | 27/74 [00:12<00:21, 2.18it/s]
2025-01-22T13:20:44.046209789Z
Loading safetensors checkpoint shards: 38% Completed | 28/74 [00:12<00:21, 2.15it/s]
2025-01-22T13:20:44.525739823Z
Loading safetensors checkpoint shards: 39% Completed | 29/74 [00:13<00:21, 2.13it/s]
2025-01-22T13:20:45.062838963Z
Loading safetensors checkpoint shards: 41% Completed | 30/74 [00:13<00:21, 2.04it/s]
2025-01-22T13:20:45.538771429Z
Loading safetensors checkpoint shards: 42% Completed | 31/74 [00:14<00:20, 2.06it/s]
2025-01-22T13:20:46.003535599Z
Loading safetensors checkpoint shards: 43% Completed | 32/74 [00:14<00:20, 2.09it/s]
2025-01-22T13:20:46.479112534Z
Loading safetensors checkpoint shards: 45% Completed | 33/74 [00:15<00:19, 2.09it/s]
2025-01-22T13:20:46.945277181Z
Loading safetensors checkpoint shards: 46% Completed | 34/74 [00:15<00:18, 2.11it/s]
2025-01-22T13:20:47.399506630Z
Loading safetensors checkpoint shards: 47% Completed | 35/74 [00:16<00:18, 2.13it/s]
2025-01-22T13:20:47.862872167Z
Loading safetensors checkpoint shards: 49% Completed | 36/74 [00:16<00:17, 2.14it/s]
2025-01-22T13:20:48.339609077Z
Loading safetensors checkpoint shards: 50% Completed | 37/74 [00:17<00:17, 2.13it/s]
2025-01-22T13:20:48.810059207Z
Loading safetensors checkpoint shards: 51% Completed | 38/74 [00:17<00:16, 2.13it/s]
2025-01-22T13:20:49.280713034Z
Loading safetensors checkpoint shards: 53% Completed | 39/74 [00:18<00:16, 2.13it/s]
2025-01-22T13:20:49.748002366Z
Loading safetensors checkpoint shards: 54% Completed | 40/74 [00:18<00:15, 2.13it/s]
2025-01-22T13:20:50.200210526Z
Loading safetensors checkpoint shards: 55% Completed | 41/74 [00:19<00:15, 2.15it/s]
2025-01-22T13:20:50.657614498Z
Loading safetensors checkpoint shards: 57% Completed | 42/74 [00:19<00:14, 2.16it/s]
2025-01-22T13:20:51.128247380Z
Loading safetensors checkpoint shards: 58% Completed | 43/74 [00:19<00:14, 2.15it/s]
2025-01-22T13:20:51.599344184Z
Loading safetensors checkpoint shards: 59% Completed | 44/74 [00:20<00:13, 2.14it/s]
2025-01-22T13:20:52.074519018Z
Loading safetensors checkpoint shards: 61% Completed | 45/74 [00:20<00:13, 2.13it/s]
2025-01-22T13:20:52.549870992Z
Loading safetensors checkpoint shards: 62% Completed | 46/74 [00:21<00:13, 2.12it/s]
2025-01-22T13:20:53.041993357Z
Loading safetensors checkpoint shards: 64% Completed | 47/74 [00:21<00:12, 2.09it/s]
2025-01-22T13:20:53.515416397Z
Loading safetensors checkpoint shards: 65% Completed | 48/74 [00:22<00:12, 2.10it/s]
2025-01-22T13:20:53.985782219Z
Loading safetensors checkpoint shards: 66% Completed | 49/74 [00:22<00:11, 2.11it/s]
2025-01-22T13:20:54.445680829Z
Loading safetensors checkpoint shards: 68% Completed | 50/74 [00:23<00:11, 2.13it/s]
2025-01-22T13:20:54.916269219Z
Loading safetensors checkpoint shards: 69% Completed | 51/74 [00:23<00:10, 2.13it/s]
2025-01-22T13:20:55.389394303Z
Loading safetensors checkpoint shards: 70% Completed | 52/74 [00:24<00:10, 2.12it/s]
2025-01-22T13:20:55.866349991Z
Loading safetensors checkpoint shards: 72% Completed | 53/74 [00:24<00:09, 2.11it/s]
2025-01-22T13:20:56.347850931Z
Loading safetensors checkpoint shards: 73% Completed | 54/74 [00:25<00:09, 2.10it/s]
2025-01-22T13:20:56.794412370Z
Loading safetensors checkpoint shards: 74% Completed | 55/74 [00:25<00:08, 2.14it/s]
2025-01-22T13:20:57.262317289Z
Loading safetensors checkpoint shards: 76% Completed | 56/74 [00:26<00:08, 2.14it/s]
2025-01-22T13:20:57.732185124Z
Loading safetensors checkpoint shards: 77% Completed | 57/74 [00:26<00:07, 2.14it/s]
2025-01-22T13:20:58.194820443Z
Loading safetensors checkpoint shards: 78% Completed | 58/74 [00:27<00:07, 2.14it/s]
2025-01-22T13:20:58.670495387Z
Loading safetensors checkpoint shards: 80% Completed | 59/74 [00:27<00:07, 2.13it/s]
2025-01-22T13:20:59.140341139Z
Loading safetensors checkpoint shards: 81% Completed | 60/74 [00:27<00:06, 2.13it/s]
2025-01-22T13:20:59.613002800Z
Loading safetensors checkpoint shards: 82% Completed | 61/74 [00:28<00:06, 2.13it/s]
2025-01-22T13:21:00.086442184Z
Loading safetensors checkpoint shards: 84% Completed | 62/74 [00:28<00:05, 2.12it/s]
2025-01-22T13:21:00.560259399Z
Loading safetensors checkpoint shards: 85% Completed | 63/74 [00:29<00:05, 2.12it/s]
2025-01-22T13:21:01.037240553Z
Loading safetensors checkpoint shards: 86% Completed | 64/74 [00:29<00:04, 2.11it/s]
2025-01-22T13:21:01.498437763Z
Loading safetensors checkpoint shards: 88% Completed | 65/74 [00:30<00:04, 2.13it/s]
2025-01-22T13:21:01.969160301Z
Loading safetensors checkpoint shards: 89% Completed | 66/74 [00:30<00:03, 2.13it/s]
2025-01-22T13:21:02.440027377Z
Loading safetensors checkpoint shards: 91% Completed | 67/74 [00:31<00:03, 2.13it/s]
2025-01-22T13:21:02.908381363Z
Loading safetensors checkpoint shards: 92% Completed | 68/74 [00:31<00:02, 2.13it/s]
2025-01-22T13:21:03.381695121Z
Loading safetensors checkpoint shards: 93% Completed | 69/74 [00:32<00:02, 2.12it/s]
2025-01-22T13:21:03.845546580Z
Loading safetensors checkpoint shards: 95% Completed | 70/74 [00:32<00:01, 2.13it/s]
2025-01-22T13:21:04.311999508Z
Loading safetensors checkpoint shards: 96% Completed | 71/74 [00:33<00:01, 2.14it/s]
2025-01-22T13:21:04.789659443Z
Loading safetensors checkpoint shards: 97% Completed | 72/74 [00:33<00:00, 2.12it/s]
2025-01-22T13:21:05.098397817Z
Loading safetensors checkpoint shards: 99% Completed | 73/74 [00:33<00:00, 2.37it/s]
2025-01-22T13:21:05.153431782Z
Loading safetensors checkpoint shards: 100% Completed | 74/74 [00:33<00:00, 2.18it/s]
2025-01-22T13:21:22.200528061Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:22 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.235285583Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.662143488Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:24.130898012Z INFO 01-22 05:21:24 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:25.970200306Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.970911363Z INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973483812Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973539062Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.978851389Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.979807043Z INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.981052641Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
2025-01-22T13:21:25.981054883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981057016Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2025-01-22T13:21:25.981058711Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981060648Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981061614Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1691, in execute_model
2025-01-22T13:21:25.981062571Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_or_intermediate_states = model_executable(
2025-01-22T13:21:25.981064343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981067242Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981068486Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981069736Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981070984Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981072436Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981080706Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981082054Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 527, in forward
2025-01-22T13:21:25.981083090Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states = self.model(input_ids, positions, kv_caches,
2025-01-22T13:21:25.981084195Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981085485Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981086427Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981087552Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981088523Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981089556Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981090564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981091589Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 483, in forward
2025-01-22T13:21:25.981092501Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states, residual = layer(positions, hidden_states,
2025-01-22T13:21:25.981093914Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981094910Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981096149Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981097296Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981098229Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981099158Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981100096Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981101044Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 412, in forward
2025-01-22T13:21:25.981101969Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states = self.mlp(hidden_states)
2025-01-22T13:21:25.981104922Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981105903Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981106883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981107795Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981108916Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981109861Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981110821Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981111777Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 158, in forward
2025-01-22T13:21:25.981112700Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] final_hidden_states = self.experts(
2025-01-22T13:21:25.981113637Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^
2025-01-22T13:21:25.981114804Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981115921Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981117012Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981118081Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981119209Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981120307Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981121683Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 522, in forward
2025-01-22T13:21:25.981123129Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] final_hidden_states = self.quant_method.apply(
2025-01-22T13:21:25.981124109Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981125040Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 463, in apply
2025-01-22T13:21:25.981126118Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return torch.ops.vllm.fused_marlin_moe(
2025-01-22T13:21:25.981127065Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981129560Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
2025-01-22T13:21:25.981130829Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._op(*args, **(kwargs or {}))
2025-01-22T13:21:25.981132369Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981133322Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 202, in fused_marlin_moe
2025-01-22T13:21:25.981134990Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] assert hidden_states.dtype == torch.float16
2025-01-22T13:21:25.981135915Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981136986Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] AssertionError
2025-01-22T13:21:25.981138483Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981140056Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] The above exception was the direct cause of the following exception:
2025-01-22T13:21:25.981141343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981142564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981143687Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
2025-01-22T13:21:25.981144781Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] output = executor(*args, **kwargs)
2025-01-22T13:21:25.981145931Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981147028Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981148165Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981149122Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981150069Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 202, in determine_num_available_blocks
2025-01-22T13:21:25.981150994Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] self.model_runner.profile_run()
2025-01-22T13:21:25.981151936Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981152905Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981154018Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981155223Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1331, in profile_run
2025-01-22T13:21:25.981158094Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] self.execute_model(model_input, kv_caches, intermediate_tensors)
2025-01-22T13:21:25.981159193Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981160159Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981161245Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981162372Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
2025-01-22T13:21:25.981179430Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] raise type(err)(
2025-01-22T13:21:25.981180548Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25
Add --dtype float16
or use the new moe_wna16
kernel which needs to be built from source.