Do you have the llava-34b.gguf file?

#4
by AiCreatornator - opened

Your folder have mmproj-llava-34b-f16-q6_k.gguf but not the main file llava-34b.gguf. Do you have it and could you upload it?

Owner

I'm just uploading a Q3_K quantized version of 34B but I don't have that a great upload rate, so getting the other 34B ones up might take a while.
You can use my PR (see the readme) together with the official llava-1.6 release model to convert the model yourself also.

  1. lava-surgery-v2.py (with -C flag)
  2. convert.py
  3. quantize the model as you'd quantize any llm

Thank you very much for your work!

Off-topic stupid question: The "-i, --interactive run in interactive mode" does not seem to work with the ./llava-cli. Are there other ways to chat with the model? Now it just ends the program and unloads the models after first question and answer.

Owner

Just making sure: Look at the number of tokens that were used.
If the prompt processing is below 1200 tokens you are not using the new code, but old code (with no quality on llava 1.6)
When using the correct code, with the correct mmproj file, you'll see 1200-3000 tokens used for the prompt

-i is not supported by llava-cli but the server app and the libraries of coure support it, I don't know about your wrapper library

Not sure if you mean these numbers: encode_image_with_clip: image embedding created: 2884 tokens, llama_print_timings: total time = 370632.47 ms / 1729 tokens

Oh, I thought it supported -i, because "./llava-cli --help" prints:
usage: ./llava-cli [options]

options:
-h, --help show this help message and exit
--version show version and build info
-i, --interactive run in interactive mode

Owner

Yes, looks great except for the slow processing speed.
If you have accelerating hardware (GPU) make sure to compile with offload support and offload the model.

Example on a 4090:
encode_image_with_clip: 5 segments encoded in 159.16 ms Image was split into 5 features (llava 1.6 will always be 3-5)
encode_image_with_clip: image embedding created: 2884 tokens the number of features generated in total

And finally:
llama_print_timings: prompt eval time = 1179.70 ms / 2930 tokens ( 0.40 ms per token, 2483.67 tokens per second) 2930 tokens in total were evaluated for the prompt (2884 + text)

encode_image_with_clip: 5 segments encoded in 417.37 ms

My horrible "370632.47 ms / 1729 tokens" might be because I use WSL2 and I load the weights from different drive

Owner

Make sure you have full offloading enabled (in llama.cpp that would be true for all the models when using "-ngl 99"
When using WSL from a non-linux file system you need to manually disable mmap, when using a native filesystem it should be at least as fast as on native Windows.
Aside of the loading time, there should not be a performance hit, when using full offloading. Once the weights are loaded into the GPU VRAM (or RAM) the disk is not accessed anymore.

cmp-nct changed discussion status to closed

Sign up or log in to comment