> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1
you're assuming the PR will land:
> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.
This certainly wouldn't fly in my org (even with test coverage/passes).
>> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.
> This certainly wouldn't fly in my org (even with test coverage/passes).
To be fair, this seems expected. A distilled model might struggle more with aggressive quantization (like q6) since you're stacking two forms of quality loss: the distillation loss and the quantization loss. I think the answer would be to just use the higher cost full precision model.
To some extent, yes. I would not run production off of it, even if it can eek out performance gains on hardware at hand. I'd suggest vLLM or TGI or something similar instead.
you're assuming the PR will land:
> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.
This certainly wouldn't fly in my org (even with test coverage/passes).