System Prompt

#2
by Wanfq - opened

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The result seems biased.

I'm also confused to get this much lower results compared to their reported, especially on AIME24...

image.png

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.

The system prompt for evaluation is set to:

You are a helpful and harmless assistant. You should think step-by-step.

We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.

The updated evaluation results are presented here:

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 93.71 57.58 95.90 68.70 82.17 59.69

More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

image.png

DeepSeek org

Please kindly refer to the following link:
https://github.com/deepseek-ai/DeepSeek-R1#usage-recommendations

Покажи пример Telegram bot, python aiogram принимающий видео от юзера, обрабатывающий его и отправляющий обратно. База данных mongodb

Сделай

Can you confirm that this model can't answer the coding question which even standard qwen-7b-instruct answers?

Explain the bug in the following code:

from time import sleep
from multiprocessing.pool import ThreadPool
 
def task():
    sleep(1)
    return 'all done'

if __name__ == '__main__':
    with ThreadPool() as pool:
        result = pool.apply_async(task())
        value = result.get()
        print(value)

After long thinking it always answers that there is no bug. But the bug is in result = pool.apply_async(task). Almost all recent models of similar size answer it easily.

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.

We have finished all the evaluation and updated the results here:

fuseo1-preview-low.jpg

The reproduce details can be found in our blog: https://huggingface.co./blog/Wanfq/fuseo1-preview

We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview

Our models are in : https://huggingface.co./collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977

Have fun!

Sign up or log in to comment