ngxson's comments

ngxson · 2025-05-10T06:51:10 1746859870

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

a_e_k · 2025-05-10T08:09:59 1746864599

I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!

thatspartan · 2025-05-10T11:50:46 1746877846

Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.

moffkalast · 2025-05-10T12:51:48 1746881508

Ok but what's the quality of the high speed response? Can the sub-2.2B ones output a coherent sentence?

ngxson · 2025-05-10T06:42:31 1746859351

And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl

danielhanchen · 2025-05-10T06:44:42 1746859482

OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!

ngxson · 2025-05-10T06:47:41 1746859661

Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

I have no idea how to specify custom layer specs with multi GPU, but that is interesting!

danielhanchen · 2025-05-10T06:57:03 1746860223

WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc

ngxson · 2025-05-10T07:07:50 1746860870

Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

(See the code in side llama_model_default_params())

danielhanchen · 2025-05-10T07:24:01 1746861841

Oh no worries! I re-edited my comment to account for it :)

ngxson · 2025-05-10T06:38:50 1746859130

For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!

ngxson · 2025-05-10T06:13:46 1746857626

Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

nolist_policy · 2025-05-10T08:39:11 1746866351

On the other hand ollama supports iSWA for Gemma 3 while llama.cpp doesn't. iSWA reduces kv cache size to 1/6.

vlovich123 · 2025-05-10T09:18:24 1746868704

What’s iSWA? Can’t find any reference online

imtringued · 2025-05-10T10:19:32 1746872372

Gemma 3 has some layers with a context size of 1024 tokens and others having full length. You need to read the Gemma technical report.

nolist_policy · 2025-05-10T10:13:11 1746871991

interleaved sliding window attention

roger_ · 2025-05-10T07:26:52 1746862012

Won’t the changes eventually be added to ollama? I thought it was based on llama.cpp

diggan · 2025-05-10T11:28:46 1746876526

As far as I understand (not affiliated, just a user who peeked at the code), Ollama started out using llama.cpp as a runner for everything. But eventually they wrote their own runner in Golang, which is where they add support for new models. So most models you run via Ollama uses llama.cpp, but new stuff their own Golang runner.

danielhanchen · 2025-05-10T06:22:18 1746858138

By the way - fantastic work again on llama.cpp vision support - keep it up!!

ngxson · 2025-05-10T06:41:11 1746859271

Thanks Daniel! Kudos for your great work on quantization, I use the Mistral Small IQ2_M from unsloth during development and it works very well!!

danielhanchen · 2025-05-10T06:54:13 1746860053

:)) I did have to update the chat template for Mistral - I did see your PR in llama.cpp for it - confusingly the tokenizer_config.json file doesn't have a chat_template, and it's rather in chat_template.jinja - I had to move the chat template into tokenizer_config.json, but I guess now with your fix its fine :)

ngxson · 2025-05-10T07:01:43 1746860503

Ohhh nice to know! I was pretty sure that someone already tried to fix the chat template haha, but because we also allow users to freely create their quants via the GGUF-my-repo space, I have to fix the quants produces from that source

danielhanchen · 2025-05-10T07:23:45 1746861825

Glad it all works now!

ngxson · on Jan 28, 2025

Hi I'm Xuan-Son,

Small correct, I'm not just asking it to convert ARM NEON to SIMD, but for the function handling q6_K_q8_K, I asked it to reinvent a new approach (without giving it any prior examples). The reason I did that was because it failed writing this function 4 times so far.

And a bit of context here, I was doing this during my Sunday and the time budget is 2 days to finish.

I wanted to optimize wllama (wasm wrapper for llama.cpp that I maintain) to run deepseek distill 1.5B faster. Wllama is totally a weekend project and I can never spend more than 2 consecutive days on it.

Between 2 choices: (1) to take time to do it myself then maybe give up, or (2) try prompting LLM to do that and maybe give up (at worst, it just give me hallucinated answer), I choose the second option since I was quite sleepy.

So yeah, turns out it was a great success in the given context. Just does it job, saves my weekend.

Some of you may ask, why not trying ChatGPT or Claude in the first place? Well, short answer is: my input is too long, these platforms straight up refuse to give me the answer :)

amarcheschi · on Jan 28, 2025

Aistudio.google.com offers free long context chats (1/2mln tokens), just select the appropriate model, 1206 or 2.0 flash thinking

simonw · on Jan 28, 2025

Thanks very much for sharing your results so far.

ngxson · on Aug 18, 2023

This project aims to support U2F / FIDO2 using fingerprint reader on Linux (via libfprint). The goal is to have the same user experience with 2FA using Windows Hello.

This project is based on https://github.com/danstiner/rust-u2f with minor modification (see my fork: https://github.com/ngxson/rust-u2f-pkexec)

Link to the project: https://github.com/ngxson/softu2f-fprintd-docker