Will this work with vLLM?
#10
by
nickandbro
- opened
I see this is based off of idefics3, so technically?
Im trying to use this with vllm chat completions, but it will just generate until the context window is full.
vllm serve HuggingFaceTB/SmolVLM-Instruct
curl -X 'POST' \
'http://localhost:8000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-d '{
"model": "HuggingFaceTB/SmolVLM-Instruct",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}'
And it just generates 16k tokens until stopping. It seems to work better with standard /completions
endpoints (actually ends correctly), but i would prefer chat-style messages.
Looking into it
Should be fixed, there was an issue with the chat_template. vLLM was parsing the wrong one.
Hey, I am trying to do a similar thing. But the prompt generated by vllm openai compatible server is very short compared to directly using smolvlm
import argparse
from openai import OpenAI
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def main():
parser = argparse.ArgumentParser(description='Image captioning using OpenAI API')
parser.add_argument('--image_url', help='URL of the image to caption')
parser.add_argument('--api-key', default="abc", help='OpenAI API key')
parser.add_argument('--api-base', default="http://localhost:8000/v1", help='OpenAI API base URL')
args = parser.parse_args()
client = OpenAI(
api_key=args.api_key,
base_url=args.api_base,
)
detailed_prompt = """You are an expert image captioner. Please provide a detailed description of the image that covers:
1. Main subjects and their characteristics
2. Background elements and setting
3. Colors, lighting, and atmosphere
4. Notable details or unique features
5. Spatial relationships between elements
6. Overall mood or impression
Your description should be at least 100 words long and cover all these aspects comprehensively."""
try:
chat_response = client.chat.completions.create(
model="HuggingFaceTB/SmolVLM-Instruct",
temperature=0.4,
top_p=0.8,
frequency_penalty=0.2,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": detailed_prompt },
{"type": "image_url", "image_url": {"url": args.image_url}},
],
}],
)
print("Caption:", chat_response.choices[0].message.content)
except Exception as e:
print(f"Error type: {type(e)}")
print(f"Error message: {str(e)}")
if __name__ == "__main__":
main()
For this image:
Prompt generated via vllm:
An elf woman holding a bow and arrow and an arrow nocked as she stands in a forest.
Prompt when we directly use smolvlm code:
The background of the image is a dense forest, with green trees and plants. The trees are tall and have thick trunks, with branches extending upwards and outwards. The leaves are lush and green, indicating that it is either early spring or late summer. The sunlight is filtering through the trees, creating a dappled pattern on the ground.
The elf is standing in a forest clearing, surrounded by trees. The ground is covered with a layer of leaves and moss, indicating that it is a well-maintained area. The forest is dense, with a variety of plants and trees, suggesting that it is a natural, untouched forest.
The elf is holding a bow and arrow, which are both made of wood. The bow is long and slender, with a curved shape. The arrow is made of wood and has a pointed tip. The elf is aiming at a target, which is not visible in the image.
The elf is standing in a relaxed pose, with her feet shoulder-width apart. Her posture is confident and poised, suggesting that she is ready for action. Her expression is serious and focused, indicating that she is concentrating on her task.
The image has a serene and natural atmosphere, with the forest providing a backdrop that is both beautiful and mysterious. The lighting is soft and diffused, creating a sense of calm and tranquility. The overall impression is one of peace and harmony, with the elf's presence adding a touch of magic and wonder to the scene.
In summary, the image depicts a female elf in a forest setting, dressed in a green dress and holding a bow and arrow. The elf is standing in a forest clearing, surrounded by trees and plants.