> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1 you're as...

Jimmc414 · on Jan 28, 2025

>> Small thing to note here, for this q6_K_q8_K, it is very difficult to get the correct result. To make it works, I asked deepseek to invent a new approach without giving it prior examples. That's why the structure of this function is different from the rest.

> This certainly wouldn't fly in my org (even with test coverage/passes).

To be fair, this seems expected. A distilled model might struggle more with aggressive quantization (like q6) since you're stacking two forms of quality loss: the distillation loss and the quantization loss. I think the answer would be to just use the higher cost full precision model.

Philpax · on Jan 29, 2025

llama.cpp optimises for hackability, not necessarily maintainability or cleanliness. You can look around the repository to get a feel for what I mean.

almostgotcaught · on Jan 29, 2025

i guess that means no one should use it for anything serious? good to know

Philpax · on Jan 29, 2025

To some extent, yes. I would not run production off of it, even if it can eek out performance gains on hardware at hand. I'd suggest vLLM or TGI or something similar instead.