'DeepseekV3ForCausalLM' object has no attribute 'get_embed_and_head'
I am trying to run the shared command but get the following error.
[2025-03-05 10:36:11 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in __init__
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 68, in __init__
embed, head = self.target_worker.model_runner.model.get_embed_and_head()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1931, in __getattr__
raise AttributeError(
AttributeError: 'DeepseekV3ForCausalLM' object has no attribute 'get_embed_and_head'
Hey @ispobock . I tried using the latest docker image but it seems it is 2 weeks old, so this wasn't included. Let me try again and let you know.
A few questions:
- Do you know if a newer image could be pushed ?
- What's the best way to build sglang docker image ? Going to try this one: https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile
- How did this model get created ?
Hey @ispobock . I built the docker image from master and it worked. However, the number of tokens per second is twice slower on H200. I am getting 15t/s instead of 37t/s. Any ideas why it could be ?
Also, would you mind sharing what should be the expected speed from using speculative decoding ?
One extra note. I benchmarked sglang master (image I built) vs latest from dockerhub with only --tp 8 argument on H200 and it seems sglang got slightly slower (36.3 toks/second instead of 37.1 toks/second)
I am getting 15t/s instead of 37t/s. Any ideas why it could be ?
@tchaton Could you share your whole commands? For DeepSeek-R1 with speculative decoding on H200, the expected TPS is about 67t/s. Could you test it by many times? That may need warm-up.
One extra note. I benchmarked sglang master (image I built) vs latest from dockerhub with only --tp 8 argument on H200 and it seems sglang got slightly slower (36.3 toks/second instead of 37.1 toks/second)
The results may fluctuate a bit. You can test by more times.
Here were the steps.
- Clone sglang master and build the docker image associated https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile
- Download R1 full weight from HF
docker run --network host --gpus all --ipc=host -v /teamspace/studios/this_studio/DeepSeek-R1:/DeepSeek-R1 sglang-master --model-path /DeepSeek-R1 --tp 8 --speculative-algo EAGLE --speculative-draft-model lmsys/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4
This gave me 15t/s when benchmarking it. And I did warm as I run the benchmark multiple times.
@ispobock I just realised my benchmark script sent random text to the model to compute a lower bound for the generation. But this will most likely not work with speculative decoding. So let me try with real text.
Did you use sglang benchmarking script ? If yes, what was the command you used ?
I am trying
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --max-concurrency 25 --num-prompts 3000 --sharegpt-context-len 1000 --sharegpt-output-len 300 --model deepseek-ai/D
eepSeek-R1
Ok, the number are much better now but it isn't at 67t/s.
docker run --gpus all --shm-size 32g -p 30000:30000 \
-e HF_TOKEN=.... \
-v /teamspace/studios/this_studio/my-models:/model \
--ipc=host --network=host --privileged $1 \
python3 -m sglang.launch_server --model /model/deepseek-ai/DeepSeek-R1 \
--tp 8 \
--trust-remote-code \
--port 30000 \
--speculative-algo NEXTN \
--speculative-draft-model lmsys/DeepSeek-R1-NextN \
--speculative-num-steps 2 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 4
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --max-concurrency 1 --num-prompts 100 --sharegpt-context-len 1000 --sharegpt-output-len 300 --model deepseek-ai/Dee
pSeek-R1
...
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: 1
Successful requests: 100
Benchmark duration (s): 506.66
Total input tokens: 22009
Total generated tokens: 30000
Total generated tokens (retokenized): 29861
Request throughput (req/s): 0.20
Input token throughput (tok/s): 43.44
Output token throughput (tok/s): 59.21
Total token throughput (tok/s): 102.65
Concurrency: 1.00
Accept length: 2.52
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 5066.03
Median E2E Latency (ms): 4933.66
---------------Time to First Token----------------
Mean TTFT (ms): 332.02
Median TTFT (ms): 218.70
P99 TTFT (ms): 2330.85
---------------Inter-Token Latency----------------
Mean ITL (ms): 15.87
Median ITL (ms): 13.70
P95 ITL (ms): 21.18
P99 ITL (ms): 39.95
Max ITL (ms): 44.12
==================================================
I use this command to benchmark it:
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256
Hey @ispobock Getting this
⚡ main ~/sglang python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256 --model deepseek-ai/DeepSeek-R1
Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/teamspace/studios/this_studio/sglang/python/sglang/bench_one_batch_server.py", line 25, in <module>
from sglang.srt.entrypoints.http_server import launch_server
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/entrypoints/http_server.py", line 44, in <module>
from sglang.srt.entrypoints.engine import _launch_subprocesses
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/entrypoints/engine.py", line 36, in <module>
from sglang.srt.managers.data_parallel_controller import (
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 27, in <module>
from sglang.srt.managers.io_struct import (
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/io_struct.py", line 25, in <module>
from sglang.srt.managers.schedule_batch import BaseFinishReason
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/schedule_batch.py", line 43, in <module>
from sglang.srt.configs.model_config import ModelConfig
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/configs/__init__.py", line 4, in <module>
from sglang.srt.configs.qwen2_5_vl_config import (
File "/teamspace/studios/this_studio/sglang/python/sglang/srt/configs/qwen2_5_vl_config.py", line 1005, in <module>
AutoImageProcessor.register(Qwen2_5_VLConfig, None, Qwen2_5_VLImageProcessor, None)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py", line 628, in register
IMAGE_PROCESSOR_MAPPING.register(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 833, in register
raise ValueError(f"'{key}' is already used by a Transformers model.")
ValueError: '<class 'sglang.srt.configs.qwen2_5_vl_config.Qwen2_5_VLConfig'>' is already used by a Transformers model.
Also, interestingly. It works with --speculative-algo NEXTN but fail with EAGLE