Usage

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "nota-ai/phiva-4b-hf"

prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
    attn_implementation="eager"
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True))

Terms of use

The vision-language model published in this repository was developed by combining several modules (e.g., vision encoder, language model). Commercial use of any modifications, additions, or newly trained parameters made to combine these modules is not allowed. However, commercial use of the unmodified modules is allowed under their respective licenses. If you wish to use the individual modules commercially, you may refer to their original repositories and licenses provided below.

Vision encoder (license) link : Model, License

Language model (license) link : Model, License

VLM framework (license) link: Github, License

Downloads last month
33
Safetensors
Model size
3.92B params
Tensor type
FP16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Datasets used to train nota-ai/phiva-4b-hf