bys0318 commited on
Commit
b523198
1 Parent(s): f61d034

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -3
README.md CHANGED
@@ -19,6 +19,9 @@ license: llama3.1
19
 
20
  LongWriter-llama3.1-8b is trained based on [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and is capable of generating 10,000+ words at once.
21
 
 
 
 
22
 
23
  A simple demo for deployment of the model:
24
  ```python
@@ -41,9 +44,34 @@ output = model.generate(
41
  response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
42
  print(response)
43
  ```
44
- Please ahere to the prompt template (system prompt is optional): `<<SYS>>\n{system prompt}\n<</SYS>>\n\n[INST]{query1}[/INST]{response1}[INST]{query2}[/INST]{response2}...`
45
-
46
- Environment: `transformers==4.43.0`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  License: [Llama-3.1 License](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
49
 
 
19
 
20
  LongWriter-llama3.1-8b is trained based on [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and is capable of generating 10,000+ words at once.
21
 
22
+ Environment: `transformers>=4.43.0`
23
+
24
+ Please ahere to the prompt template (system prompt is optional): `<<SYS>>\n{system prompt}\n<</SYS>>\n\n[INST]{query1}[/INST]{response1}[INST]{query2}[/INST]{response2}...`
25
 
26
  A simple demo for deployment of the model:
27
  ```python
 
44
  response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
45
  print(response)
46
  ```
47
+ You can also deploy the model with [vllm](https://github.com/vllm-project/vllm), which allows 10,000+ words generation within a minute. Here is an example code:
48
+ ```python
49
+ model = LLM(
50
+ model= "THUDM/LongWriter-llama3.1-8b",
51
+ dtype="auto",
52
+ trust_remote_code=True,
53
+ tensor_parallel_size=1,
54
+ max_model_len=32768,
55
+ gpu_memory_utilization=0.5,
56
+ )
57
+ tokenizer = model.get_tokenizer()
58
+ generation_params = SamplingParams(
59
+ temperature=0.5,
60
+ top_p=0.8,
61
+ top_k=50,
62
+ max_tokens=32768,
63
+ repetition_penalty=1,
64
+ )
65
+ query = "Write a 10000-word China travel guide"
66
+ prompt = f"[INST]{query}[/INST]"
67
+ input_ids = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0].tolist()
68
+ outputs = model.generate(
69
+ sampling_params=generation_params,
70
+ prompt_token_ids=[input_ids],
71
+ )
72
+ output = outputs[0]
73
+ print(output.outputs[0].text)
74
+ ```
75
 
76
  License: [Llama-3.1 License](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
77