'DeepseekV3ForCausalLM' object has no attribute 'get_embed_and_head'

#1
by tchaton - opened

I am trying to run the shared command but get the following error.

[2025-03-05 10:36:11 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in __init__
    self.draft_worker = EAGLEWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 68, in __init__
    embed, head = self.target_worker.model_runner.model.get_embed_and_head()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1931, in __getattr__
    raise AttributeError(
AttributeError: 'DeepseekV3ForCausalLM' object has no attribute 'get_embed_and_head'
Large Model Systems Organization org

@tchaton This model can be used after this commit. Could you try the latest main branch of SGLang?

Hey @ispobock . I tried using the latest docker image but it seems it is 2 weeks old, so this wasn't included. Let me try again and let you know.

A few questions:

Hey @ispobock . I built the docker image from master and it worked. However, the number of tokens per second is twice slower on H200. I am getting 15t/s instead of 37t/s. Any ideas why it could be ?

Also, would you mind sharing what should be the expected speed from using speculative decoding ?

One extra note. I benchmarked sglang master (image I built) vs latest from dockerhub with only --tp 8 argument on H200 and it seems sglang got slightly slower (36.3 toks/second instead of 37.1 toks/second)

Large Model Systems Organization org
edited 4 days ago

I am getting 15t/s instead of 37t/s. Any ideas why it could be ?

@tchaton Could you share your whole commands? For DeepSeek-R1 with speculative decoding on H200, the expected TPS is about 67t/s. Could you test it by many times? That may need warm-up.

One extra note. I benchmarked sglang master (image I built) vs latest from dockerhub with only --tp 8 argument on H200 and it seems sglang got slightly slower (36.3 toks/second instead of 37.1 toks/second)

The results may fluctuate a bit. You can test by more times.

Here were the steps.

  1. Clone sglang master and build the docker image associated https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile
  2. Download R1 full weight from HF
docker run --network host --gpus all --ipc=host -v /teamspace/studios/this_studio/DeepSeek-R1:/DeepSeek-R1  sglang-master --model-path /DeepSeek-R1 --tp 8 --speculative-algo EAGLE --speculative-draft-model lmsys/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4

This gave me 15t/s when benchmarking it. And I did warm as I run the benchmark multiple times.

@ispobock I just realised my benchmark script sent random text to the model to compute a lower bound for the generation. But this will most likely not work with speculative decoding. So let me try with real text.

Did you use sglang benchmarking script ? If yes, what was the command you used ?

I am trying

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --max-concurrency 25 --num-prompts 3000 --sharegpt-context-len 1000 --sharegpt-output-len 300 --model deepseek-ai/D
eepSeek-R1

Ok, the number are much better now but it isn't at 67t/s.

docker run --gpus all --shm-size 32g -p 30000:30000 \
    -e HF_TOKEN=.... \
    -v /teamspace/studios/this_studio/my-models:/model \
    --ipc=host --network=host --privileged $1 \
    python3 -m sglang.launch_server --model /model/deepseek-ai/DeepSeek-R1 \
    --tp 8 \
    --trust-remote-code \
    --port 30000 \
    --speculative-algo NEXTN \
    --speculative-draft-model lmsys/DeepSeek-R1-NextN \
    --speculative-num-steps 2 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 4
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --max-concurrency 1 --num-prompts 100 --sharegpt-context-len 1000 --sharegpt-output-len 300 --model deepseek-ai/Dee
pSeek-R1
...
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                1         
Successful requests:                     100       
Benchmark duration (s):                  506.66    
Total input tokens:                      22009     
Total generated tokens:                  30000     
Total generated tokens (retokenized):    29861     
Request throughput (req/s):              0.20      
Input token throughput (tok/s):          43.44     
Output token throughput (tok/s):         59.21     
Total token throughput (tok/s):          102.65    
Concurrency:                             1.00      
Accept length:                           2.52      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   5066.03   
Median E2E Latency (ms):                 4933.66   
---------------Time to First Token----------------
Mean TTFT (ms):                          332.02    
Median TTFT (ms):                        218.70    
P99 TTFT (ms):                           2330.85   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.87     
Median ITL (ms):                         13.70     
P95 ITL (ms):                            21.18     
P99 ITL (ms):                            39.95     
Max ITL (ms):                            44.12     
==================================================
Large Model Systems Organization org

I use this command to benchmark it:

python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

Hey @ispobock Getting this

⚡ main ~/sglang python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256 --model deepseek-ai/DeepSeek-R1
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/teamspace/studios/this_studio/sglang/python/sglang/bench_one_batch_server.py", line 25, in <module>
    from sglang.srt.entrypoints.http_server import launch_server
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/entrypoints/http_server.py", line 44, in <module>
    from sglang.srt.entrypoints.engine import _launch_subprocesses
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/entrypoints/engine.py", line 36, in <module>
    from sglang.srt.managers.data_parallel_controller import (
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 27, in <module>
    from sglang.srt.managers.io_struct import (
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/io_struct.py", line 25, in <module>
    from sglang.srt.managers.schedule_batch import BaseFinishReason
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/managers/schedule_batch.py", line 43, in <module>
    from sglang.srt.configs.model_config import ModelConfig
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/configs/__init__.py", line 4, in <module>
    from sglang.srt.configs.qwen2_5_vl_config import (
  File "/teamspace/studios/this_studio/sglang/python/sglang/srt/configs/qwen2_5_vl_config.py", line 1005, in <module>
    AutoImageProcessor.register(Qwen2_5_VLConfig, None, Qwen2_5_VLImageProcessor, None)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py", line 628, in register
    IMAGE_PROCESSOR_MAPPING.register(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 833, in register
    raise ValueError(f"'{key}' is already used by a Transformers model.")
ValueError: '<class 'sglang.srt.configs.qwen2_5_vl_config.Qwen2_5_VLConfig'>' is already used by a Transformers model.

Also, interestingly. It works with --speculative-algo NEXTN but fail with EAGLE

Sign up or log in to comment