NexaAIDev/OmniVLM-968M · Run omnivision on Nvidia Jetson-Orin

I'm trying to run omnivision with nexa sdk through a Python script. However, It seems like llama-cpp is not supported. is there a way run omnivision locally on a Nvidia jetson orin? Moreover, I can see that whenever we call the model it will load and after the request, it'll unload from GPU, which takes time . is there a way to keep the model in gpu therefore I can get better inference time?

from nexa.gguf import NexaVLMInference

model_path = "omnivision"
inference = NexaVLMInference(
model_path=model_path,
local_path=None,
stop_words=[],
temperature=0.7,
max_new_tokens=2048,
top_k=50,
top_p=1.0,
profiling=True
)

inference._chat(user_input="Describe this image in detail.", image_path="path/to/local/image")
This is the code I'm using