Request for exl quants.
Oobabooga recently released a feature allowing 4-bit cache, I tested it and found out I can potentially squeeze somewhere between 4.5-4.6bpw in a 15gb t4 gpu. I would like to request these quants if its okay with you.
no problem, I'll upload them tomorrow, they are already baking :) (4.5bpw 6h and 4.6bpw 6h). What kind of notebook are you using for colab, something like this one?: https://github.com/DocShotgun/LLM-notebooks/blob/main/exllama-fast-inference.ipynb
Nah, I'm using the same oobabooga official Colab, I just changed a line of code from git cloning the main to cloning the dev branch. It's quite amazing that I managed to run such a high quant on free tier. Once flash attention gets supported on T4 gpu's I could go even higher. I once somehow managed to run a 4.65 bpw of U-amethyst 20b at 6k context using 4-bit cache, though that was a chance fluke.. When I tried on other similiar quants, it didn't work out at 6k context as starting vram after full inference was 14.2gb, I can only afford a starting vram on 14.0gb.... At best. As after I start using the loaded model about 700-1gb vram would be added then removed.
qants are ready: https://huggingface.co./collections/TeeZee/15-gb-colab-65ea6562a5ea41e870995dfc. I know that Kaggle gives 2xT4 for free but i havent tested it yet(for finetunig its hit or miss with their pytorch versions unf.). Thanks for the tip about ooba notebook, will try it.
Hmm. Not sure about Kaggle, Their phone verification system for activating free tier is garbage and noones made a kaggle oobabooga notebook as far as I can tell. Also, I tested a bit and 4.6bpw at 6k does not fit. so far 4.5bpw at 6k works and I estimated that the optimal would be 4.55bpw. It's okay if you don't want to quant. at 4.5 bpw I reached 14.5/15 gb vram usage.
I'll add 4.55 tomorrow. Is model stable at 6k context? - models used for merge rarely specify the context length in their READMEs, but they are all llama2 so only 4k default context should be available.
Well when I used the model at 6k context it was fine so far. Exl quants in general have a default of 4k context when loaded without specified context value, unless they're finetuned to higher context. For the defaults, I used alpha_value flag to extend context. It increases the model's capability for higher context at the cost of a small quality loss. This is useful for increasing context from 4k to at least 6-8k context. I used wiki's recommended alpha values for 6-8k context. The higher the alpha_value, the more quality loss. I tried a 13b 6bpw at a rope scaled 16k context with an alpha_value of 4 and the gens were garbage. You can check the wiki for more info.
Ok, that's good news, model works and there is no perceivable quality loss, I didn't play with bigger alpha_values yet, so I'll definitely test against it also, next iterations of my models. 4.55 quant is up , let me know is it 'the sweet spot' for colab.
Just tested it and yup, its the sweets spot. Thanks for the quant.
Any chance for 3bpw and/or 3.5bpw quants as well? 4bpw works great with 3060 12GB and the new Q4 cache but only ~2k context fits comfortably. Checking another model 3.5bpw seems to fit with 4k context so was hoping to get that at least.
Oh yeah, just letting you know that Q4 cache finally went into main, so no need to use dev branch.
@Annuvin , here you go https://huggingface.co./TeeZee/DarkForest-20B-v2.0-bpw3.0-h6-exl2 and https://huggingface.co./TeeZee/DarkForest-20B-v2.0-bpw3.5-h6-exl2, please let me know what worked best on 3060 12GB.
Much obliged, didn't even realize there was a 3-bit quant already. Using TabbyAPI (exllamav2 compiled from source, Q4 cache) as backend, SillyTavern as frontend and accounting for browser and Windows processes I can fit 8192 context with 3.0bpw (barely but works), 4096 with 3.5bpw, and 2048 with 4.0bpw. I'd say 3.5 is probably the sweet spot leaving around 800 MB of VRAM to spare but depends on use case.
Hello, I would like to request another exl quant if you have the time, a 4.25bpw quant. I'd like to to try 8k context and this is the estimation I got from:
4bpw: 13gb
4.125bpw:13.5gb
sure, quant is up: https://huggingface.co./TeeZee/DarkForest-20B-v2.0-bpw4.25-h6-exl2, let me know if it works with 8k context on colab.