Update README.md
Browse files
README.md
CHANGED
@@ -38,6 +38,52 @@ We apply tailored prompts for coding and math task:
|
|
38 |
{question} + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
|
39 |
```
|
40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
## Evaluation
|
42 |
|
43 |
After finetuning, the performance of our Eurus-2-7B-SFT is shown in the following figure.
|
|
|
38 |
{question} + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
|
39 |
```
|
40 |
|
41 |
+
```python
|
42 |
+
import os
|
43 |
+
from tqdm import tqdm
|
44 |
+
import torch
|
45 |
+
from transformers import AutoTokenizer
|
46 |
+
from vllm import LLM, SamplingParams
|
47 |
+
os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
|
48 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "true"
|
49 |
+
def generate(question_list,model_path):
|
50 |
+
llm = LLM(
|
51 |
+
model=model_path,
|
52 |
+
trust_remote_code=True,
|
53 |
+
tensor_parallel_size=torch.cuda.device_count(),
|
54 |
+
gpu_memory_utilization=0.90,
|
55 |
+
)
|
56 |
+
sampling_params = SamplingParams(max_tokens=8192,
|
57 |
+
temperature=0.0,
|
58 |
+
n=1)
|
59 |
+
outputs = llm.generate(question_list, sampling_params, use_tqdm=True)
|
60 |
+
completions = [[output.text for output in output_item.outputs] for output_item in outputs]
|
61 |
+
return completions
|
62 |
+
def make_conv_hf(question, tokenizer):
|
63 |
+
# for math problem
|
64 |
+
content = question + "\n\nPresent the answer in LaTex format: \\boxed{Your answer}"
|
65 |
+
# for code problem
|
66 |
+
# content = question + "\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end."
|
67 |
+
msg = [
|
68 |
+
{"role": "user", "content": content}
|
69 |
+
]
|
70 |
+
chat = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
|
71 |
+
return chat
|
72 |
+
|
73 |
+
def run():
|
74 |
+
model_path = "PRIME-RL/Eurus-2-7B-SFT"
|
75 |
+
all_problems = [
|
76 |
+
"which number is larger? 9.11 or 9.9?"
|
77 |
+
]
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
79 |
+
completions = generate([make_conv_hf(problem_data, tokenizer) for problem_data in all_problems],model_path)
|
80 |
+
print(completions)
|
81 |
+
# [['[ASSESS]\n\n# The task is to compare two decimal numbers, 9.11 and 9.9, to determine which one is larger.\n# The numbers are in a standard decimal format, making direct comparison possible.\n# No additional information or context is provided that could affect the comparison.\n\nNext action: [ADVANCE]\n\n[ADVANCE]\n\n# To compare the two numbers, I will examine their whole and decimal parts separately.\n# The whole part of both numbers is 9, so I will focus on the decimal parts.\n# The decimal part of 9.11 is 0.11, and the decimal part of 9.9 is 0.9.\n# Since 0.9 is greater than 0.11, I can conclude that 9.9 is larger than 9.11.\n\nNext action: [VERIFY]\n\n[VERIFY]\n\n# I will review my comparison of the decimal parts to ensure accuracy.\n# Upon re-examination, I confirm that 0.9 is indeed greater than 0.11.\n# I also consider the possibility of a mistake in my initial assessment, but the comparison seems straightforward.\n# I evaluate my process and conclude that it is sound, as I correctly identified the whole and decimal parts of the numbers and compared them accurately.\n# No potential errors or inconsistencies are found in my reasoning.\n\nNext action: [OUTPUT]\n\n[OUTPUT]\n\nTo determine which number is larger, 9.11 or 9.9, I compared their whole and decimal parts. Since the whole parts are equal, I focused on the decimal parts, finding that 0.9 is greater than 0.11. After verifying my comparison, I concluded that 9.9 is indeed larger than 9.11.\n\n\\boxed{9.9}\n\n']]
|
82 |
+
if __name__ == "__main__":
|
83 |
+
run()
|
84 |
+
```
|
85 |
+
|
86 |
+
|
87 |
## Evaluation
|
88 |
|
89 |
After finetuning, the performance of our Eurus-2-7B-SFT is shown in the following figure.
|