Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's possible to run a voice AI similar to this entirely locally on a normal gaming PC with a good GPU using open source AI models. I have a standalone demo I've been working on that you can try if you have a 12GB+ Nvidia GPU: https://apps.microsoft.com/detail/9NC624PBFGB7

It's still very much a demo and not as good as GPT-4, but it responds much faster. It's fun to play with and it shows the promise. Open models have been improving very quickly. It remains to be seen just how good they can get but personally I believe that better-than-GPT-4 models are going to run on a $1k gaming PC in just a few years. You will be able to have a coherent spoken conversation with your GPU. It's a new type of computing experience.



I've seen a couple of these tts + voice recognition + LLM projects popping up on GitHub as well:

https://github.com/modal-labs/quillman

I also built something similar using WebKit speech recognition (limited to Chromium) a year back for my own use but it was hooked to davinci-003.


What models/libraries is it based on?


Currently OpenHermes2-Mistral-7B (via exllama2), OpenAI Whisper (via faster_whisper), and StyleTTS2 (uses HF Transformers). All PyTorch-based.

I will probably update to the OpenHermes vision model when Nous Research releases it, so it'll be able to see with the webcam or even read your screen and chat about what you're working on! I also need to update to Whisper-v3 or Distil-Whisper, and I need to update to a newer StyleTTS2. I also plan to add a Mandarin TTS and Qwen-7B bilingual LLM for a bilingual chatbot. The amount of movement in open AI (not to be confused with OpenAI) is difficult to keep up with.

Of course I need to add better attribution for all this stuff, and a whole lot of other things, like a basic UI. Very much an MVP at the moment.


This is why I think the personal Jarvis for everyday use won’t eventually be in the cloud. It can already be done on local hw as you are demonstrating and cloud has big downsides around privacy and latency and reliability.

Like you said it’s difficult to keep up with and to me it feels very much like open source stuff might win for inference.


And yet of all things word processing and spreadsheets are going to the cloud, or even coding.

Not sure big players won't be pushing heavily (as in, not releasing their best models) for the fat subscriptions/data gathering in the cloud, even if I'd much rather see local (as in cloud at home) computing


I understand what you mean. It makes me wonder if there is room for a solution those who want to own their own hardware and data. Almost like other appliances and equipment that initial cost too much for household ownership but maybe having a “brain” in your house will be a luxury appliance for example. As a Crestron system owner I would love to plugin Jarvis to my smart home somehow.


I'm thinking I will open source mine for people who have GPUs but possibly try making a paid service for people who don't.


Maybe it can be a hardware with markup and support and consulting model. That way there could be many competitors in various regions or different countries and we all use the same collection of open source tools. That would be pretty neat. Unlikely I guess but still worth thinking about how it could work.


Those things are going to the cloud for completely unrelated reasons


Thanks for such a detailed answer!


Related to the subthread above, and something I've been thinking about - how do you detect when the user has stopped speaking so the bot can respond?


Great question! Right now both ChatGPT and my demo are doing very simple and basic stuff that definitely needs improvement.

ChatGPT is essentially push-to-talk with a little bit of automation to attempt to press the button automatically at certain times. Mine is continuously listening and can be interrupted while speaking, but isn't yet smart enough to delay responding if you pause in the middle of a sentence, or stop responding at the natural end of a conversation.

I wrote up my detailed thoughts about it here: https://news.ycombinator.com/item?id=38339222


That is really interesting, thanks for the insight!


Would love to see a video of this ...


> if you have a 12GB+ Nvidia GPU

Doesn't that seem a bit excessive for Whisper and Coqui? Or does it also run an LLM for a full local stack?


It's a full local stack. The TTS isn't Coqui, it's StyleTTS2.


Ahh cool, makes sense then.

Haven't heard of that one before, I'll have to check it out.


It's quite new. Good quality and very fast.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: