It's possible to run a voice AI similar to this entirely locally on a normal gam...

vunderba · on Nov 22, 2023

I've seen a couple of these tts + voice recognition + LLM projects popping up on GitHub as well:

I also built something similar using WebKit speech recognition (limited to Chromium) a year back for my own use but it was hooked to davinci-003.

amelius · on Nov 22, 2023

What models/libraries is it based on?

modeless · on Nov 22, 2023

Currently OpenHermes2-Mistral-7B (via exllama2), OpenAI Whisper (via faster_whisper), and StyleTTS2 (uses HF Transformers). All PyTorch-based.

I will probably update to the OpenHermes vision model when Nous Research releases it, so it'll be able to see with the webcam or even read your screen and chat about what you're working on! I also need to update to Whisper-v3 or Distil-Whisper, and I need to update to a newer StyleTTS2. I also plan to add a Mandarin TTS and Qwen-7B bilingual LLM for a bilingual chatbot. The amount of movement in open AI (not to be confused with OpenAI) is difficult to keep up with.

Of course I need to add better attribution for all this stuff, and a whole lot of other things, like a basic UI. Very much an MVP at the moment.

alchemist1e9 · on Nov 22, 2023

This is why I think the personal Jarvis for everyday use won’t eventually be in the cloud. It can already be done on local hw as you are demonstrating and cloud has big downsides around privacy and latency and reliability.

Like you said it’s difficult to keep up with and to me it feels very much like open source stuff might win for inference.

makapuf · on Nov 22, 2023

And yet of all things word processing and spreadsheets are going to the cloud, or even coding.

Not sure big players won't be pushing heavily (as in, not releasing their best models) for the fat subscriptions/data gathering in the cloud, even if I'd much rather see local (as in cloud at home) computing

alchemist1e9 · on Nov 22, 2023

I understand what you mean. It makes me wonder if there is room for a solution those who want to own their own hardware and data. Almost like other appliances and equipment that initial cost too much for household ownership but maybe having a “brain” in your house will be a luxury appliance for example. As a Crestron system owner I would love to plugin Jarvis to my smart home somehow.

modeless · on Nov 22, 2023

I'm thinking I will open source mine for people who have GPUs but possibly try making a paid service for people who don't.

alchemist1e9 · on Nov 23, 2023

Maybe it can be a hardware with markup and support and consulting model. That way there could be many competitors in various regions or different countries and we all use the same collection of open source tools. That would be pretty neat. Unlikely I guess but still worth thinking about how it could work.

catmanjan · on Nov 22, 2023

Those things are going to the cloud for completely unrelated reasons

amelius · on Nov 22, 2023

Thanks for such a detailed answer!

lemming · on Nov 22, 2023

Related to the subthread above, and something I've been thinking about - how do you detect when the user has stopped speaking so the bot can respond?

modeless · on Nov 22, 2023

Great question! Right now both ChatGPT and my demo are doing very simple and basic stuff that definitely needs improvement.

ChatGPT is essentially push-to-talk with a little bit of automation to attempt to press the button automatically at certain times. Mine is continuously listening and can be interrupted while speaking, but isn't yet smart enough to delay responding if you pause in the middle of a sentence, or stop responding at the natural end of a conversation.

I wrote up my detailed thoughts about it here: https://news.ycombinator.com/item?id=38339222

lemming · on Nov 22, 2023

That is really interesting, thanks for the insight!

speedywaiverapp · on Nov 22, 2023

Would love to see a video of this ...

moffkalast · on Nov 22, 2023

> if you have a 12GB+ Nvidia GPU

Doesn't that seem a bit excessive for Whisper and Coqui? Or does it also run an LLM for a full local stack?

modeless · on Nov 22, 2023

It's a full local stack. The TTS isn't Coqui, it's StyleTTS2.

moffkalast · on Nov 22, 2023

Ahh cool, makes sense then.

Haven't heard of that one before, I'll have to check it out.

modeless · on Nov 23, 2023

It's quite new. Good quality and very fast.