Thanks.
#1
by
dinerburger
- opened
Thank you for taking such care with this quant. I really appreciate your attention to detail.
Quick question, though: how did you specify 8bpw for the first and last layers during quantization? I'd like to use this technique myself for some code-specific models, but I'm not seeing command line flags in exllamav2 conversation scripts. Thanks again!
Exllamav2 quantization script doesn't have options for this - I modified the script to use quantization method with highest accuracy for first and last layer. The change is quite simple:
- I added a new flag to be able to switch this option on or off in convert_exl2.py, I added it in argument parser (after this) and later to the job dict (after this):
parser.add_argument("-fl8", "--first_last_q8", action = "store_true", help = "Use Q8 at 1st and last layer")
# ... and later the dict ...
# dict definition is quite big, it starts like this
job = {"in_dir": args.in_dir,
# This is added at the end of dict definition
"first_last_q8": args.first_last_q8}
- And the more important part, in optimize.py, I modified the choice of quantization methods at the end (modified this):
# This if is added by me, "first_last_q8" is name of the new option from first step
if job["first_last_q8"]:
last_layer_idx = num_layers - 1
if layer_ == 0 or layer_ == last_layer_idx:
# 1st or last layer - get method with best accuracy, should be Q8 usually
p1 = max(params[layer_ * 2], key=lambda element: element["accuracy"])
p2 = max(params[layer_ * 2 + 1], key=lambda element: element["accuracy"])
else:
# Do normal stuff
p1 = params[layer_ * 2][solution_idx[layer_ * 2]]
p2 = params[layer_ * 2 + 1][solution_idx[layer_ * 2 + 1]]
else:
# in normal script there are only these 2 lines
p1 = params[layer_ * 2][solution_idx[layer_ * 2]]
p2 = params[layer_ * 2 + 1][solution_idx[layer_ * 2 + 1]]
Yeah, I wondered if that was the case. Perfect, thank you again for sharing the code!