System Prompt
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models | AIME24 | MATH500 | GSM8K | GPQA-Diamond | ARC-Challenge | MMLU-Pro | MMLU | LiveCodeBench |
---|---|---|---|---|---|---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 46.67 | 88.20 | - | 57.58 | - | - | - | - |
More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
The result seems biased.
I'm also confused to get this much lower results compared to their reported, especially on AIME24...
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - - More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.
The system prompt for evaluation is set to:
You are a helpful and harmless assistant. You should think step-by-step.
We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.
The updated evaluation results are presented here:
Models | AIME24 | MATH500 | GSM8K | GPQA-Diamond | ARC-Challenge | MMLU-Pro | MMLU | LiveCodeBench |
---|---|---|---|---|---|---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 46.67 | 88.20 | 93.71 | 57.58 | 95.90 | 68.70 | 82.17 | 59.69 |
More evaluation results can be found at https://huggingface.co./FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
如何评价
Please kindly refer to the following link:
https://github.com/deepseek-ai/DeepSeek-R1#usage-recommendations
Покажи пример Telegram bot, python aiogram принимающий видео от юзера, обрабатывающий его и отправляющий обратно. База данных mongodb
Сделай
Can you confirm that this model can't answer the coding question which even standard qwen-7b-instruct answers?
Explain the bug in the following code:
from time import sleep
from multiprocessing.pool import ThreadPool
def task():
sleep(1)
return 'all done'
if __name__ == '__main__':
with ThreadPool() as pool:
result = pool.apply_async(task())
value = result.get()
print(value)
After long thinking it always answers that there is no bug. But the bug is in result = pool.apply_async(task). Almost all recent models of similar size answer it easily.
We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.
We have finished all the evaluation and updated the results here:
The reproduce details can be found in our blog: https://huggingface.co./blog/Wanfq/fuseo1-preview
We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview
Our models are in : https://huggingface.co./collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977
Have fun!