THUDM
/

LongWriter-llama3.1-8b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bys0318 commited on Aug 18

Commit

b523198

•

1 Parent(s): f61d034

Update README.md

Files changed (1) hide show

README.md +31 -3

README.md CHANGED Viewed

@@ -19,6 +19,9 @@ license: llama3.1
 LongWriter-llama3.1-8b is trained based on [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and is capable of generating 10,000+ words at once.
 A simple demo for deployment of the model:
 ```python
@@ -41,9 +44,34 @@ output = model.generate(
 response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
 print(response)
 ```
-Please ahere to the prompt template (system prompt is optional): `<<SYS>>\n{system prompt}\n<</SYS>>\n\n[INST]{query1}[/INST]{response1}[INST]{query2}[/INST]{response2}...`
-Environment: `transformers==4.43.0`
 License: [Llama-3.1 License](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)

 LongWriter-llama3.1-8b is trained based on [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), and is capable of generating 10,000+ words at once.
+Environment: `transformers>=4.43.0`
+Please ahere to the prompt template (system prompt is optional): `<<SYS>>\n{system prompt}\n<</SYS>>\n\n[INST]{query1}[/INST]{response1}[INST]{query2}[/INST]{response2}...`
 A simple demo for deployment of the model:
 ```python
 response = tokenizer.decode(output[context_length:], skip_special_tokens=True)
 print(response)
 ```
+You can also deploy the model with [vllm](https://github.com/vllm-project/vllm), which allows 10,000+ words generation within a minute. Here is an example code:
+```python
+model = LLM(
+    model= "THUDM/LongWriter-llama3.1-8b",
+    dtype="auto",
+    trust_remote_code=True,
+    tensor_parallel_size=1,
+    max_model_len=32768,
+    gpu_memory_utilization=0.5,
+)
+tokenizer = model.get_tokenizer()
+generation_params = SamplingParams(
+    temperature=0.5,
+    top_p=0.8,
+    top_k=50,
+    max_tokens=32768,
+    repetition_penalty=1,
+)
+query = "Write a 10000-word China travel guide"
+prompt = f"[INST]{query}[/INST]"
+input_ids = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0].tolist()
+outputs = model.generate(
+    sampling_params=generation_params,
+    prompt_token_ids=[input_ids],
+)
+output = outputs[0]
+print(output.outputs[0].text)
+```
 License: [Llama-3.1 License](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)