Could this be further quantized to a GGUF-file?

#2
by AiCreatornator - opened

Thanks for this!

Could this be further quantized to GGUF-file? Have you tested it?

tldr: llama.cpp doesn't support it yet.

I'm just a random user, and I have no idea how LLMs work internally.
I downloaded some other much smaller repo from them with suffix bnb-8bit-smashed (not bnb-4bit-smashed like here).
I ran convert.py and got this:

  File "/home/arzeth/llama.cpp-cuda/./convert.py", line 940, in convert
    data_type = SAFETENSORS_DATA_TYPES[info['dtype']]
                ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'I8'

So I did this (I'm not sure if it's correct):

diff --git a/convert.py b/convert.py
index 24df0a4d..b9bb6bd7 100755
--- a/convert.py
+++ b/convert.py
@@ -69,6 +69,7 @@ class UnquantizedDataType(DataType):

 DT_F16  = UnquantizedDataType('F16',  dtype = np.dtype(np.float16), valid_conversions = ['F32', 'Q8_0'])
 DT_F32  = UnquantizedDataType('F32',  dtype = np.dtype(np.float32), valid_conversions = ['F16', 'Q8_0'])
+DT_I8   = UnquantizedDataType('I8',   dtype = np.dtype(np.int8),    valid_conversions = ['Q8_0'])
 DT_I32  = UnquantizedDataType('I32',  dtype = np.dtype(np.int16),   valid_conversions = [])
 DT_BF16 = UnquantizedDataType('BF16', dtype = np.dtype(np.uint16),  valid_conversions = ['F32', 'F16', 'Q8_0'])

@@ -113,7 +114,7 @@ DT_Q8_0 = Q8_0QuantizedDataType('Q8_0',

 # Quantized types skipped here because they may also map to np.float32
 NUMPY_TYPE_TO_DATA_TYPE: dict[np.dtype[Any], DataType] = {}
-for dt in (DT_BF16, DT_F16, DT_F32, DT_I32):
+for dt in (DT_BF16, DT_F16, DT_F32, DT_I8, DT_I32):
     if dt.dtype in NUMPY_TYPE_TO_DATA_TYPE:
         raise ValueError(f'Invalid duplicate data type {dt}')
     NUMPY_TYPE_TO_DATA_TYPE[dt.dtype] = dt
@@ -122,6 +123,7 @@ SAFETENSORS_DATA_TYPES: dict[str, DataType] = {
     'BF16': DT_BF16,
     'F16': DT_F16,
     'F32': DT_F32,
+    'I8':  DT_I8,
     'I32': DT_I32,
 }

@@ -1236,7 +1238,7 @@ def pick_output_type(model: LazyModel, output_type_str: str | None) -> GGMLFileT
         return GGMLFileType.AllF32
     if output_type_str == "f16" or (output_type_str is None and wq_type == DT_F16):
         return GGMLFileType.MostlyF16
-    if output_type_str == "q8_0":
+    if output_type_str == "q8_0" or (output_type_str is None and wq_type == DT_I8):
         return GGMLFileType.MostlyQ8_0

     name_to_type = {name: lazy_tensor.data_type for (name, lazy_tensor) in model.items()}

Then I got FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft'], so I added --vocab-type bpe

But then I got ValueError: Unexpected tensor name: model.layers.0.mlp.down_proj.SCB. Use --skip-unknown to ignore it (e.g. LLaVA)
I added --skip-unknown, it created a .gguf, but it of course wrote """""""" when I asked 2+2=? because there were

Unexpected tensor name: model.layers.21.mlp.down_proj.SCB - skipping
Unexpected tensor name: model.layers.21.mlp.gate_proj.SCB - skipping
Unexpected tensor name: model.layers.21.mlp.up_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.k_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.o_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.q_proj.SCB - skipping
Unexpected tensor name: model.layers.21.self_attn.v_proj.SCB - skipping

for several layers during the gguf creation.

Pruna AI org

Sign up or log in to comment