Hacker Newsnew | past | comments | ask | show | jobs | submit | gapeleon's commentslogin

Bad timing. I just tried it, and the entire platform went offline within 20 minutes.

https://console.carolinacloud.io/ is unreachable, my container is also unreachable.

https://downforeveryoneorjustme.com/console.carolinacloud.io...


I'm going to try training it on a codebook to see if such a small model would work for a TTS.


You guys need to fact check your AI-generated blog posts:

https://blog.mozilla.ai/introducing-any-llm-a-unified-api-to...

> One popular solution, LiteLLM, is highly valued for its wide support of different providers and modalities, making it a great choice for many developers. However, it re-implements provider interfaces rather than leveraging SDKs that are managed and released by the providers themselves. As a result, the approach can lead to compatibility issues and unexpected modifications in behavior, making it difficult to keep up with the changes happening among all the providers.

LiteLLM is rock-solid in practice. The underlying API providers announce breaking changes well in advance, and LiteLLM has never been caught out by this. LLMs will come up with hypothetical cons like this upon request.

> Lastly, proxy/gateway solutions like OpenRouter and Portkey require users to set up a hosted proxy server to act as an intermediary between their code and the LLM provider. Although this can effectively abstract away the complicated logic from the developer, it adds an extra layer of complexity and a dependency on external services, which might not be ideal for all use cases.

OpenRouter is a hosted service that provides the proxy/gateway infrastructure. Users don't "set up a hosted proxy server" themselves; they just make API calls to OpenRouter's endpoints. But older LLMs don't know what OpenRouter is and will assume it's a self-hosted proxy server.

> Another option, AISuite, was created by Andrew NG and offers a clean and modular design. However, it is not actively maintained (its last release was in December of 2024) and lacks consistent Python-typed interfaces.

Okay so you clicked the "releases" tab and saw December 2024. Next time check https://github.com/andrewyng/aisuite/commits/main/ Small, fast moving community projects like this, exllamav2, etc don't necessarily tag releases.

I've got nothing against using AI to write posts like this, but at least take the time to fact check before dumping on other people's work.

If not for the Mozilla branding, I'd have assumed this was a scam/malware - especially since it's name is so similar to Anything-LLM.


For English-only an non-commercial, Parakeet has been almost flawless for me.

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

I use it for real-time chat and generating subtitles. It can do a tv show in less than a minute on a 3090.

Whisper always hallucinated too much for me. It's more useful as a classifier.


You can run an openai-compatible endpoint and point open-webui at it if you want this. I had to add a function to filter out markdown lists, code, etc as the model was choking on them.


This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.


> Interesting that there isn't a mention of Orpheus as prior art either

Llasa-3b (https://huggingface.co/HKUSTAudio/Llasa-3B) came out before Orpheus (https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).

> it's the exact same thing.

They're very similar, but they're not the exact same thing.

Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.

Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here: https://huggingface.co/spaces/Gapeleon/snac_test

But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example)


Do you happen to know why Orpheus and Llasa use Finetuning for voice cloning?

Zonos uses 128-float embeddings for voices and it seems so much nicer. Because you can just mix and match voices without changing the model.


No, you just condition it with text-voice token pairs and then when conditioning further inference w/ text the voice tokens tend to match the pairs further up in the context.


Isn't xcodec2 also lossy? I thought it is also just another neural codec (50 tok/s, single codebook).

What are people using to upsampling back to 44,1 or 48 khz? Anything fancy?


They’re both lossy. They use a VAE-VQ type architecture trained with a combination of losses/discriminators. The differences are mainly the encoder/decoder architecture, the type of bottleneck quantization (RVQ, FSQ, etc.) and of course the training data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: