Open R1: Update #4

Community Article Published March 26, 2025

Upvote

lvwerra Leandro von Werra

open-r1

reach-vb Vaibhav Srivastav

open-r1

dvilasuero Daniel Vila

open-r1

yjernite Yacine Jernite

open-r1

Welcome DeepSeek-V3 0324

This week, a new model from DeepSeek silently landed on the Hub. It’s an updated version of DeepSeek-V3, the base model underlying the R1 reasoning model. There isn’t much information shared yet on this new model, but we do know a few things!

What we know so far

The model has the same architecture as the original DeepSeek-V3 and now also comes with an MIT license, while the previous V3 model had a custom model license. The focus of this model release was on improving the instruction following as well as code and math capabilities. Let’s have a look!

How good is it?

The DeepSeek team has evaluated the model on a range of math and coding tasks and we can see the model’s strong capabilities compared to other frontier models:

Clearly, the model plays in the top league: often on par with GPT-4.5 and generally stronger than Claude-Sonnet-3.7.

To summarise the model has seen significant improvements across benchmarks

MMLU-Pro: 75.9 → 81.2 (+5.3) (A good benchmark for overall understanding)
GPQA: 59.1 → 68.4 (+9.3)
AIME: 39.6 → 59.4 (+19.8) (proxy for MATH capabilities)
LiveCodeBench: 39.2 → 49.2 (+10.0) (indicator of coding abilities)

Specifically, in the model card the DeepSeek mentions targeted improvements in the following areas:

Front-End Web Development
- Improved executability of the code
- More aesthetically pleasing web pages and game front-ends
Chinese Writing Proficiency
- Enhanced style and content quality
  - Aligned with the R1 writing style
  - Better quality in medium-to-long-form writing
- Feature Enhancements
  - Improved mutli-turn interactive rewriting
  - Optimized translation quality and letter writing
Chinese Search Capabilities
- Enhanced report analysis requests with more detailed outputs
Function Calling Improvements
- Increased accuracy in Function Calling, fixing issues in previous V3 versions

So the question might pop-up: how did they actually do this? Let’s speculate a bit!

How did they do it?

Given the naming and architecture it is fairly safe to assume that the new model is based on the previous V3 model and trained on top of it. There are two possible areas how they improved the models:

Continual pretraining: Starting with the V3 model it’s possible to continue the pretraining process with a) newer, more up-to-date data and b) use data that has been better curated and thus higher quality. This will improve the factuality on recent events and improve the capabilities generally.
Improved post-training: Especially in the era of instruction following and style post-training plays the most important role. Likely they improved the post-training data mix and maybe even the algorithm.

Until the team releases a technical report we don’t know for sure what they tweaked but the post-training pipeline is quite likely and potentially also adding a bit of pretraining. So have a look at how to use the models next!

How to use the model

Inference Providers

You can use Hugging Face’s Inference Providers to quickly experiment with this model. It’s available through Fireworks, Hyperbolic, and Novita.

Here’s an example using the huggingface_hub library. You can also use the OpenAI client library like in this example.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fireworks-ai",
    #api_key="your hf or provider token"
)

messages = [
    {
        "role": "user",
        "content": "My first is second in line; I send shivers up your spine; not quite shining bright. I glitter in the light."
    }
]

completion = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=messages,
    temperature=0.3,
)

print(completion.choices[0].message['content'])
# ...**Final Answer: ice**

Text Generation Inference

TGI supports running DeepSeek V3-0324 with its latest release as well. You can use it directly with the tagged docker image on a node of H100s

docker run --gpus all --shm-size 1g -p 8080:80 -v $ volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.2.1 --model-id deepseek-ai/DeepSeek-V3-0324

SGLang

SGLang supports running DeepSeek V3-0324 out of the box along with the Multi Latent Attention and Data Parallelism optimisations as well. To use you can simply just run the following on a node of H100s. For more information follow along here.

docker pull lmsysorg/sglang:latest

docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --trust-remote-code --port 30000

Dynamic Quants from Unsloth and Llama.cpp

Running large LLMs like DeepSeek V3-0324 can be quite compute intensive and would require a large amount GPU VRAM to run. This is where Quantization comes in, it allows the end user to use the same model but with much lower VRAM consumption with a small trade-off in downstream performance.

Unsloth AI created Dynamic quantisations which allow one to run DeepSeek V3 with half the amount of compute as one node of H100 and can run with llama.cpp without as much degradation in benchmarks. Read more about it here: https://huggingface.co./unsloth/DeepSeek-V3-0324-GGUF

Is it safe?

Running language model safely has always been at the center of attention, ever since the first GPT models have been released. With the immense popularity of the DeepSeek models and their origin the question has found new interest. Let us run down the things that are safe to do and areas where some caution is a good idea. This is not DeepSeek specific but true for any open model!

First of all - is it safe to even download the model?

Downloading and running the model

Yes, downloading the model is safe. There are a few precautions on the Hub side that make sure it’s safe to download and run models:

Safetensors: The safetensors format is used to store the DeepSeek model weights on the Hub ensuring no hidden code execution is possible; which was a risk with the older PyTorch pickle format. Thus no malicious code can be hidden in the weights file. Read more in the Safetensors blog.
Modeling code: To run the model, the modeling code also needs to be downloaded along with the weight files. There are three mechanisms in place to improve safety there: 1. the files are fully visible on the hub, 2. the user needs to explicitly set trust_remote_code=True to execute any code associated with the model, 3. a security scanner runs over files on the hub and flags any malicious code files. If you want to be extra careful you can pin the model version with the revision setting to make sure you download the version of the modeling code that has been reviewed.

So downloading the weights is safe, and upon code review so is executing the modeling code. This means you can run the DeepSeek model locally without the risk of backdoors or malicious code execution.

So what would be the main risks outside of downloading and running the model? It depends on what you do with the model outputs!

Model outputs

The advice that follows is not specific to any model, and applies to both open and closed models: whether considering risks stemming from built-in secret behaviours in the model or from a model accidentally producing bad outputs.

We’ll cover risks in three areas: alignment, code generation and agents.

Alignment mismatch: Every model provider chooses how and to which values their models are aligned. What these values are and how they are chosen typically remains opaque and they might also change over time (see this study). The advantage of open models is that the alignment can be changed with custom fine-tuning at a later stage still as the example of Perplexity’s DeepSeek 1776 shows.

First image — Economic and social value shift in GPT-3.5-turbo

Second image — Refusal frequency of models DeepSeek’s model vs. R1 1776 by Perplexity.

As a rule, users should be aware that any LLM is biased in one way or another and treat the model outputs accordingly.

Code generation: One of the most popular use-cases of LLMs is as coding assistants. However, this is also where indiscriminate usage of the model outputs can have the most negative effects. Models are trained on vast amounts of published code, new and old. This typically includes potentially malicious code or code that contains known vulnerabilities. So models might produce similar vulnerabilities when proposing code solutions.

So, how can you prevent security issues when using LLMs for code development? Run thorough code reviews of the proposed changes and scan the code with appropriate tools for vulnerabilities, as you would with any other code contribution.

Agents: In the past few months agent applications have gained significant interest, giving LLMs more autonomy and agency also bears risks. It’s important to be careful about what kind of system access agents have and which information you provide them. Some good practices:

Sandboxes: don’t run agents on your machine where they have access and control of your computer. This avoids leaking private information or accidentally deleting important files.
Private information: don’t share private information such as logins with the LLM. If you need to give the model access to a system use dedicated access keys with strict access rules.
Human-in-the-loop: for high stakes processes that you want to automate with agents make sure there is a human in the loop for final confirmation.

TL;DR: Is it safe to run the models? Yes, downloading and running the models is safe, but, as with any model, you should take precautions to use the models generations with the appropriate safety measures.

Community

clem

5 days ago

Love the security section!

JLouisBiz

5 days ago

That’s why it matters even more to use truly free software models, which are free software licenses are promoted by projects like GNU and Free Software Foundation: They allow for inspection by anyone involved in their development or maintenance.With such a model of transparency—where code is Free Software so others may see and change the program it becomes easier to make sure that there aren’t any hidden backdoors lurking inside. 👩‍💻✨

What is Free Software? - GNU Project - Free Software Foundation
https://www.gnu.org/philosophy/free-sw.html

🌟 The Place Where It All Started

JLouisBiz

5 days ago

•

edited 5 days ago

There was some error posting it, and I could see "We are working on it soon"

sasha

5 days ago

Amazing post!!! Loved the security section, I'll definitely use these points in discussions with press especially going forward.
Great work, team!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote