Text Generation
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
danielsteinigen commited on
Commit
a7e67f6
·
verified ·
1 Parent(s): 0aaf3bb

add sample for usage with vLLM to Readme

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md CHANGED
@@ -131,6 +131,47 @@ print(prediction_text)
131
 
132
  This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  ## Training Details
135
 
136
  ### Pre-Training Data
 
131
 
132
  This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
133
 
134
+ ### Usage with vLLM Server
135
+ Starting the vLLM Server:
136
+ ``` shell
137
+ vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code
138
+ ```
139
+ Use Chat API with vLLM and pass the language of the Chat-Template as extra body:
140
+ ``` python
141
+ from openai import OpenAI
142
+
143
+ client = OpenAI(
144
+ api_key="EMPTY",
145
+ base_url="http://localhost:8000/v1",
146
+ )
147
+ completion = client.chat.completions.create(
148
+ model="openGPT-X/Teuken-7B-instruct-research-v0.4",
149
+ messages=[{"role": "User", "content": "Hallo"}],
150
+ extra_body={"chat_template":"DE"}
151
+ )
152
+ print(f"Assistant: {completion]")
153
+ ```
154
+ The default language of the Chat-Template can also be set when starting the vLLM Server. For this create a new file with the name `lang` and the content `DE` and start the vLLM Server as follows:
155
+ ``` shell
156
+ vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code --chat-template lang
157
+ ```
158
+
159
+ ### Usage with vLLM Offline Batched Inference
160
+ ``` python
161
+ from vllm import LLM, SamplingParams
162
+
163
+ sampling_params = SamplingParams(temperature=0.01, max_tokens=1024, stop=["</s>"])
164
+ llm = LLM(model="openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True, dtype="bfloat16")
165
+ outputs = llm.chat(
166
+ messages=[{"role": "User", "content": "Hallo"}],
167
+ sampling_params=sampling_params,
168
+ chat_template="DE"
169
+ )
170
+ print(f"Prompt: {outputs[0].prompt}")
171
+ print(f"Assistant: {outputs[0].outputs[0].text}")
172
+ ```
173
+
174
+
175
  ## Training Details
176
 
177
  ### Pre-Training Data