Sure, I can give you detailed answers: 1- The answer is still ~1.7GB. You only n...

vladf · on March 31, 2024

I see, so we’re still fetching the metadata to gpu, and rescaling on gpu, just on-demand and discarding metadata when we’re done with that layer?

Why not do the same optimization for layer weights themselves?

mobicham · on March 31, 2024

Yes, correct, and that fetching operation is a non-blocking operation, once we dequantize the weights we discard it before moving to the next layer.

Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)

I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.

vladf · on March 31, 2024

If you’re willing to pay for the latency cost of per layer cpu fetching/offloading, I don’t see what extreme quant buys you.

You could just do a layer-by-layer fetching scheme with 4 bit weights.

For training too, just fetch each layer twice per step as needed for fwd/bwd.

And all for hbm cost equal to one layer’s worth

mobicham · on April 1, 2024

The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.

Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.

vladf · on April 1, 2024

> You still have a group-size of 64 in 4-bit fyi.

Results may vary :)

> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.

Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).

> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.