Vram usage
Can you guys tell me the vram usage of this model. I a 3080ti laptop with 8gb.
Thanks
8-9 gb of vram is required
I see 8.7-8.9 used on my 16GB laptop 3080 with the model loaded in oogabooga. It goes up to 12.2 when it's actually generating text.
8gb cards load it only with 50% layers offload to CPU
@Yuuru How can I try this?
curious wonder 12g 3060 able to run this model or not
@cyx123 I have a 12g 3060 and it has no problem running every 13b models as they fluctuate between 9 and 11 gb of vram usage.
I got RuntimeError: CUDA error: out of memory with NVIDIA 3070 8GB
I could make it work with 8GB VRAM, slow:
Output generated in 19.02 seconds (0.63 tokens/s, 12 tokens, context 132)
Output generated in 47.36 seconds (1.03 tokens/s, 49 tokens, context 230)
Output generated in 28.04 seconds (0.96 tokens/s, 27 tokens, context 363)
from https://huggingface.co./anon8231489123/vicuna-13b-GPTQ-4bit-128g/discussions/14
edit start-webui.bat and replace all the text with:
@echo off
@echo Starting the web UI...
cd /D "%~dp0"
set MAMBA_ROOT_PREFIX=%cd%\installer_files\mamba
set INSTALL_ENV_DIR=%cd%\installer_files\env
if not exist "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" (
call "%MAMBA_ROOT_PREFIX%\micromamba.exe" shell hook >nul 2>&1
)
call "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" activate "%INSTALL_ENV_DIR%" || ( echo MicroMamba hook not found. && goto end )
cd text-generation-webui
call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30
:end
pause
yeah i m also getting 1 token per second with splitting on 8GB VRAM, the performance is bad, i was able to achieve the same using ggml model + llama.cpp withDRAM
is it possible to somehow run it on 6gb vram? i have a laptop with 3060rtx
so far getting CUDA out of memory message