Hacker Newsnew | past | comments | ask | show | jobs | submit | ngxson's commentslogin

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF


I've been noticing your commits as I skim the latest git commit notes whenever I periodically pull and rebuild. Thank you for all your work on this (and llama.cpp in general)!


Thanks for landing the mtmd functionality in the server. Like the other commenter I kept poring over commits in anticipation.


Ok but what's the quality of the high speed response? Can the sub-2.2B ones output a coherent sentence?


And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl


OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!


Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

I have no idea how to specify custom layer specs with multi GPU, but that is interesting!


WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc


Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

(See the code in side llama_model_default_params())


Oh no worries! I re-edited my comment to account for it :)


For brew users, you can specify --HEAD when installing the package. This way, brew will automatically build the latest master branch.

Btw, the brew version will be updated in the next few hours, so after that you will be able to simply "brew upgrade llama.cpp" and you will be good to go!


Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm


On the other hand ollama supports iSWA for Gemma 3 while llama.cpp doesn't. iSWA reduces kv cache size to 1/6.


What’s iSWA? Can’t find any reference online


Gemma 3 has some layers with a context size of 1024 tokens and others having full length. You need to read the Gemma technical report.


interleaved sliding window attention


Won’t the changes eventually be added to ollama? I thought it was based on llama.cpp


As far as I understand (not affiliated, just a user who peeked at the code), Ollama started out using llama.cpp as a runner for everything. But eventually they wrote their own runner in Golang, which is where they add support for new models. So most models you run via Ollama uses llama.cpp, but new stuff their own Golang runner.


By the way - fantastic work again on llama.cpp vision support - keep it up!!


Thanks Daniel! Kudos for your great work on quantization, I use the Mistral Small IQ2_M from unsloth during development and it works very well!!


:)) I did have to update the chat template for Mistral - I did see your PR in llama.cpp for it - confusingly the tokenizer_config.json file doesn't have a chat_template, and it's rather in chat_template.jinja - I had to move the chat template into tokenizer_config.json, but I guess now with your fix its fine :)


Ohhh nice to know! I was pretty sure that someone already tried to fix the chat template haha, but because we also allow users to freely create their quants via the GGUF-my-repo space, I have to fix the quants produces from that source


Glad it all works now!


Hi I'm Xuan-Son,

Small correct, I'm not just asking it to convert ARM NEON to SIMD, but for the function handling q6_K_q8_K, I asked it to reinvent a new approach (without giving it any prior examples). The reason I did that was because it failed writing this function 4 times so far.

And a bit of context here, I was doing this during my Sunday and the time budget is 2 days to finish.

I wanted to optimize wllama (wasm wrapper for llama.cpp that I maintain) to run deepseek distill 1.5B faster. Wllama is totally a weekend project and I can never spend more than 2 consecutive days on it.

Between 2 choices: (1) to take time to do it myself then maybe give up, or (2) try prompting LLM to do that and maybe give up (at worst, it just give me hallucinated answer), I choose the second option since I was quite sleepy.

So yeah, turns out it was a great success in the given context. Just does it job, saves my weekend.

Some of you may ask, why not trying ChatGPT or Claude in the first place? Well, short answer is: my input is too long, these platforms straight up refuse to give me the answer :)


Aistudio.google.com offers free long context chats (1/2mln tokens), just select the appropriate model, 1206 or 2.0 flash thinking


Thanks very much for sharing your results so far.


This project aims to support U2F / FIDO2 using fingerprint reader on Linux (via libfprint). The goal is to have the same user experience with 2FA using Windows Hello.

This project is based on https://github.com/danstiner/rust-u2f with minor modification (see my fork: https://github.com/ngxson/rust-u2f-pkexec)

Link to the project: https://github.com/ngxson/softu2f-fprintd-docker


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: