Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Voice Isolator: Strip background noise for film, podcast, interview production (elevenlabs.io)
163 points by davidbarker on July 3, 2024 | hide | past | favorite | 131 comments


What is the current SOTA for voice->text?

I have a recording I've been sitting on for 2 years(a guest lecture which a friend recorded) which contains a very heavy amount of background noise, where you can just barely make out what is being said by the lecturer. I wonder if there is any hope I will ever be able to read a transcript from it.

I can figure out what the lecturer is saying (maybe only because I have some context about what he is talking about), but it is too painful to sit through 2 hours of it and try to transcribe it.

I tried uploading the audio file to this service, but basically get nothing useful returned to me.


I don't know about SOTA, but 'Adobe Podcast Studio' a web app that I believe is still free / in beta offers excellent sound cleanup. So much so that many podcast / radio producers I know no longer frequently use Izotope RX - one of the industry standard tools. Adobe are obviously horrendous, but if its for a one time use I'd give it a go. The feature you want is the 'enhance speech filter'.

https://podcast.adobe.com/enhance


STT: Whisper

TTS: GPTSOVITS / StyleTTS2

VTV: RVCv2

Open source isn't really doing a great job at voice, music, or video. It's managing to keep up in LLM and image spaces, but it's falling far behind in the multimedia department.


This is really helpful, thanks! I have a bunch of audio that I need to clean up and this looks like it could fit the bill.

Do you know if there are any license issues with this? I don't see any license page--will they train/retain the recording?


I'm not sure - that's a really good question. I'd assume anything uploaded to a deep learning system retains the data for future training. But I have no information on the licensing of this tool specifically. Especially given the recent furore over Adobe's licensing terms.


Gemini 1.5 Pro multimodal with audio file input plus a text prompt that asks for a transcription. You can add all the relevant context you know to the text prompt. You can also chat with it before asking for the full transcription and clear up any confusions. There are no products for this yet, it's a new capability.

I just started trying this today with my wife's hour long interviews in Latvian language and it is extremely good, far better than any transcription model. This is a huge SoTA LLM with audio tokens, so it just has vastly more capability than Whisper or whatever. In my case it nails all kinds of brand names, weird neologisms and loan words, it writes inline quotation marks when the speaker is quoting someone, and so on.

This is what GPT-4o supposedly can do in the version OpenAI has postponed rolling out.

If you want, I can try to do it for you if you send the audio.


I've noticed it's necessary to cut long audio into sections, otherwise it starts getting confused and repeating itself. The output token limit only lets you transcribe a few pages at a time anyway.


fwiw, the commercial services that do this are called "audio forensics" (unsurprisingly, they're usually hired by cops and lawyers). You pay them to use their (often expensive) software tools to clean up audio and provide a transcription.

I get the appeal of automating this task but the SOTA is not to automate it at all.


I would try, for 10 minutes, the following solution: you listen to the lecture and repeat every word the lecturer said. Then use the record of your voice for the voice 2 text process. If it works, do it for the whole talk.


I’ve had good results with Whisper. You can use the OpenAI API and it is also open source:

https://github.com/openai/whisper


And if you have a Mac, get MacWhisper. It's been a godsend for transcribing almost anything. Usually pretty good if the main voice is discernible at all—though in OP's case, if the main voice is almost indistinguishable it might not do amazing.


I've had good luck with Buzz:

https://github.com/chidiwilliams/buzz


Deepgram Nova 2 is among the best right now, more accurate than Whisper in my testing.


You can upload your file and try out Deepgram for free, just to see what the results look like for your audio. No harm in trying: https://deepgram.com/free-transcription

Disclosure: I work for Deepgram


We have built a dockerized open source stack for Whisper + Llama3 + MeloTTS. Whisper and MeloTTS for now works fine for our use cases. https://github.com/bolna-ai/bolna/tree/master/examples/whisp...


Recent ASR models are already robust to noise due to Spec augment and large-scale data. If you use these noise reduction services to remove noise, ASR models will have harder time to recognize denoised audio. The reason is that noise reduction will create distortion which ASR didn't see during training.


Try giving Audacity a shot to cleanup the audio, it has a built-in noise reduction feature that's configurable. I've used it to varying degrees of success, but works especially well with the same sounds ANC headphones are good at blocking.


you can pay people online trivial amounts of money for this, it will be far cheaper and quicker than waiting for the right AI

by the way, even once we get to a sufficient AI, how do you verify it without listening to the whole thing anyway?

it's only 2 hours, if you're a fast typer at max it would take you 1 work day to transcribe yourself, or <$200 by a professional


It's not an issue of typing speed, it's that puzzling out what was said takes a number of re-listens, at high gain which hurts my ears after a while since the voice is just barely above the noise floor.


Please think twice before sharing your personal voice samples with a random online website just because they offer a cool demo.


I suspect in the near future people are going to report randomly hearing themselves in advertisements.


Friends and loved ones seems like a better method.


Most people either wouldn't recognize their own voice or would hate hearing it.


Imagine if they were able to make a model that's your voice but the way that you hear it. That'd be so neat. You could hear how other people hear their own voice and have fun playing with it for an afternoon before moving onto the next shiny new toy.


"Voice Isolator costs 1000 characters for every minute of audio." - can someone expand on this currency for someone out of the loop?


Are there actual before/after samples? I’m sure as hell not sending samples of my voice to AI voice cloning company.


I mean, you don't have to?

Set up an audio source, for example, your phone, playing a reasonable length of talking, for example a youtube video, or a podcast on spotify. Then record from your computer or other recording device, and test with that?


I'd like to have something else but for live calls: a process that takes two audio inputs and "subtracts" the noise from one input from the other. My use case would be to have two dynamic microphones, one directed at the window and one that I'm using for a conference call. I'm assuming having two inputs should make the process easier for real time (20ms?) processing and might require less compute.

If such process can output a clear sound, I could chain it with Blackhole and have it and use the processed clear signal as an input for the call.


DeepFilterNet doesn't use a second microphone but it does do an absurdly good job of removing not-speech from inputs in realtime. Check out the demo video linked in the README. iirc they demonstrate removing guitar sounds and even a vacuum cleaner.

It does take some technical elbow grease to integrate but I've used it in calls and while gaming on Linux via Pipewire to great effect.

https://github.com/Rikorose/DeepFilterNet


The demo looks very impressive, I'll try it out, thank you!


Assuming that this is setup so that the same sound is coming through both microphones just one with your voice on top, you could theoretically do this just by feeding it through something that inverts the polarity of the "to be cancelled out" sound and overlays the two sounds. I'm sure it wouldn't be perfect but you might be able to tune it to properly do it. This is how a active noise cancelling works!


Thank you for the idea. I've tried in the past to do something similar but couldn't get it right. I did try to rely on ideas from ANC but my domain knowledge is very lacking. It's been over 2 years since so I might need to give it another chance and see if any off the shelf library/product has been released since then.


That doesn’t work though, the requirement for the timing to align waveform by waveform is too high, and the speed of sound is too slow. Also the frequency response isn’t going to match exactly.

To do it right you really want digital analysis.


Krisp does a pretty great job of this currently without the two mics.


Thanks, haven't tried it yet. Are you using it with a dynamic mic? Did you get the same crisp detailed sound as you get when using the mic as a raw input?


I’ve just used it for conferencing only and never really noticed a difference except in really noisy environments like a data center row.


I tried Krisp a bit, it works okay but it takes 1.5+ GB of RAM even when not actively used so maybe I'll check it again in the future to see if it can do better on that.


Don't most smartphones these days have a second noise cancelling microphone?


I think they do. I've also had huge problems with Zoom when talking to my parents, because for some reason they are aggressively muted for several seconds while nobody else on the call is canceled like that. If anyone else on the call so much as clears his throat, my parents are muted and we all have to sit silently waiting for them to be able to talk. Annoying as shit.

I suspect that this is noise cancellation that's failing because they keep their phone far away from themselves, to fit two people in the shot; and audio is bouncing off the walls or otherwise suffering enough delay to mess it up.


For this problem specifically, I have found that turning off auto mic volume helps to eliminate the "mute for first x secs if speaking".


Thanks for the info.


Yes but I was looking for a way to run a process that does that on your desktop because I'd like to do ANC using two dynamic mics.


I've been looking at https://github.com/xanguera/BeamformIt but haven't had time to give it a go yet


Nvidia Broadcast. Does an amazing job of removing background noise from audio.. Only for Nvidia GPUs though.. Not sure if it’s still RTX only though..


You'd think every DAW would have something like this: Subtract everything that's stereo (AKA keep only the sound that's present in both channels).

I have old mono records that I wanted to clean up. In that case, any stereo content is obviously scratches and surface noise, so removing it would be most of the job. But nope... not one DAW offered this filter, despite offering the opposite (removing mono content and keeping the stereo).

And yes I did try removing the mono content and then subtracting the result from the full source, but this didn't work; I don't remember (or know) why.


That's pretty interesting. I don't suppose you could do it with some manual physics/electrical engineering wizardry like Dave Rat uses in this video for canceling out audio for a centre speaker?

https://youtu.be/AxZOv0baN2Y?si=fc51MQHRItT6nYKI


Every DAW does.. You want M(id)/S(ide).. Sum+Difference.. L+R/L-R.. etc.. If they don’t have those names you mix channels, and usually can apply negative volume amount to subtract.. Or a separate invert select.. M=L+R S=L-R


Voxengo MSED is a free VST that can set gain/level on mid and side independently. https://www.voxengo.com/product/msed/


Cool, thanks. I'll check it out.


maybe you just didn't know how to do it. You can do this easily in FL studio (even free version) using stereo shaper


Never heard of "FL studio."


Really? This is (probably) the most popular DAW nowadays! It has a free trial and the only limitation is you can't open previously saved projects, but rendering works as normal. Give it a try. If you still need to isolate the mono signal, put stereo shaper on the mixer channel and choose (right click two arrows at the top of the plugin) one of the presets called mid or something, I don't remember now. It's that easy


Thanks. I'll check it out. I don't know who it's "the most popular" with, but nobody I know uses it. Logic, Pro Tools, Ableton, Reaper, Audition, even Twisted Wave for simple stuff... but never that one.


I searched for "most popular DAW"

Second on the list - https://www.musicradar.com/news/the-best-daws-the-best-music...

Fifth on the list - https://producerhive.com/buyer-guides/daw/best-daws/

Second on the list - https://mixingmonster.com/best-daws/

It's pretty popular and I'm sure it has the most tutorials on youtube


They also just announced licensed celebrity voices in their Reader app this past week.

Judy Garland, Burt Reynolds, Laurence Olivier, and James Dean are the first ones.


All four of whom are deceased. I guess they licensed from their estates?


In California, such rights last 70 years, so Dean loses protection next year, Garland in 2039, and the others much later as they died fairly recently.


Yikes, let the dead rest in peace


Tell that to their descendants bank accounts.


How is this different from Auphonic?

https://auphonic.com/features


Why does it have to be different?


... because this is by the market leader in AI speech synthesis?


I prefer https://product.supertone.ai/clear which is one time payment and not subscription based.


Ultimate Vocal Remover 5 is free, though.


The one thing I hate about this: There are so-called "first amendment auditors", who professionally annoy people on the street, trying to provoke a reaction. They monetize the resulting video on youtube.

You used to be able to pull out your phone and play Disney soundtracks or Taylor Swift music which would result in the video being non-monetizable. But improvements in audio isolation techniques have now defeated this countermeasure. Being a professional annoyance is once again a career choice.

Edit: this is one instance I've personally seen: https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1


I know the sorts of “creators” you’re talking about, but I’ve never heard of this as a response before.

Are there really that many people who 1) are aware that this could be effective, and 2) are quick witted enough to pull their phone out and play music in response to being harassed?



A single police car did*


While 1st ammendment auditors are cringe, they only annoy police. This comment is definately pro LE coded and the irony does not escape me with its criticism of people expressing their 1st amendment (regardless of how annoying their method).

Perhaps we should demonitize every form of journalism and media that annoys this guy!


I'm not sure what videos you're watching, but the majority of the ones that pop up on my feeds are them annoying non-LE government workers and regular people trying to use government services like the post office, passport office, etc. Yes, the police show up eventually, but only after they've harassed people just trying to do their jobs and live their life.


Never seen anything but cops idk, I'll take your word for it. I also find it annoying and therefore rarely watch them as I choose not to watch things that annoy me. Not sure why y'all do.


The ones I've seen in person do not "only annoy police"; they may have been trying to provoke a police response, but they were just harassing normies on the street as well. Linking to the ones I've seen: https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1


You mean like "Billy on the Street"? Or something closer to the "comedy gang" from Viral Hit?

Can't really imagine why you would both give a response to an interview-style question, while being recorded, and simultaneously not want that response to be public. Or are they doing it secretly?


I added this to the original post, but here's the incident that I saw that made me aware of the whole scene: https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1


This technique relies on bizarrely over-powerful intellectual property infringement counter-measures built into youtube, the platform. Relying on it gives me serious XKCD spacebar-heater vibes. It wasn't designed for that.

Yes, people who are assholes in public are annoying. Shoplifting and bank robbing are probably also career choices. Don't rely on a side effect of "big copyright" systems to save us.


>You used to be able to pull out your phone and play Disney soundtracks or Taylor Swift music which would result in the video being non-monetizable. But improvements in audio isolation techniques have now defeated this countermeasure.

In my opinion, this is a bug, not a feature. If you pull out your phone and play Taylor Swift, you are in fact making a public performance without permission. Even if you had permission (as some cops allegedly do to use some bands music for this purpose), this is not the correct method to deal with professional annoyances.

As a police officer, your job is to be the adult in the room. Society is trusting you with a tremendous amount of power. If you can't handle some annoying whiny YouTubers professionally without using "countermeasures", you should hang up your badge and get another job.


Cops on HN!? Just a heads up, you have a choice not to watch and inadvertently reward people who annoy you on the internet.


Same with police and citizen videographers. Being a citizen reporter is once again an option.


> You used to be able to pull out your phone and play Disney soundtracks or Taylor Swift music which would result in the video being non-monetizable.

This is actually illegal for you to do.


Why not just sue the hell out of them? Would also break their business model really fast.


Because there's no legal basis to sue them.


Maybe there is, maybe there isn't. But they'd be forced to pay exorbitant fees to lawyers regardless.


If a judge finds that the lawsuit was frivolous then their exorbitant lawyer fees are now yours to pay.


>professionally annoy people on street

>Provoke a response

They mostly do it to cops and people in authority. It's their right to do so, they should be able to. They expose so many cops and authoritarians who blatantly do not respect citizens' civil rights. Good for them.

The fact that "oh no you're annoying me I'm going to arrest you because you're annoying" is even a talking point from you is baffling.


> They mostly do it to cops and people in authority.

In Santa Barbara there is a group that targets random businesses; random shops and restaurants with outdoor eating.

It sucks for the business, it sucks for their clients, it sucks for random people walking by on the street.

I'm all for limiting the unchecked authority we give police, we need to end qualified immunity, etc. But we should take the problem on directly. And I'm all for filming cops who abuse their privilege. But the reality I've seen in person is this is sucky.

> The fact that "oh no you're annoying me I'm going to arrest you because you're annoying" is even a talking point from you is baffling.

Who are you replying to? What did I say that's even close to this. Talk about baffling.


>Who are you replying to? What did I say that's even close to this. Talk about baffling.

Good point, I may have misread part of what you said.


Tried it with several files.

It didn't seem to do much better than audio filters for ffmpeg that have been tuned for removing background noise and enhancing voice. Maybe I'm missing something or using the wrong source data.


I have used ai|coustics previously and I think their output quality is way better than Eleven Labs or Auphonic. They really do a good job there.


The video is impressive, but for the files I uploaded is hallucinated or changed the voices quiet a bit. IMO ai-coustics.com or auphonic is a better alternative. ai-coustics even offers video upload and let's you choose the enhancement level.


Sorry a noob here, is there sth can strip the Youtube videos voice talking, but leave everything else 100% untouched? I dunnon what this thing called, or is it exactly the same thing I was looking for?


Several devices can do this, but I would suggest the audio program 'audacity' to start with. Ymmv as it's always a trade off between not removing enough voice, and having too much of the background erased at the same time. Other programs often call it 'karaoke mode'.


Thanks for your wiki reply.


Or I could just download virtual dj and run it for free on a computer and just do this locally, right now, with zero fancy hardware and arguably some of the best stems algorithms on the market.


I had very loud background music playing, and while it could completely eliminate that (impressive!), the voice was much more garbled then when there wasn’t any background noise playing.


My test sample, me talking with my baby babbling in the background, returned a silent audio track. I guess I nor the baby are considered signal ~_~


I’m sorry you had to find out this way, Deckard (Rachel?).


Are there any open source STT solutions that also handle speaker diarization? MacWhisper has promised this for a long time without delivering.


Elevenlabs has some pretty cool stuff but I really despise how it's all cloud based. Wish there was an audio ai company following a path similar to what topaz has been doing for video/photo ai with desktop software. Open source has been lagging more than I expected in this area too.


GPTSOVITS, StyleTTS2, and RVCv2 are still the open source SOTA for TTS and voice conversion. These models are unfortunately really far behind Elevenlabs' offerings. We're not much further along than the Tacotron2 (2018) days.

Elevenlabs is the only model company I can think of that is ahead of everyone else in their category. Video and LLMs are hyper competitive, but voice is a one-company game. Elevenlabs hired up everyone in the space and utterly dominates.

I'm hoping this changes. They've been in pole position for over a year and a half now with nobody even coming close.

There's probably a reason why they're so research-oriented. The minute an open source model is released that rivals Elevenlabs in quality, they're in big trouble. There's absolutely zero moat for their current products and there are fifty companies nipping at their heels that want to be in the same spot. Elevenlabs' current margins are juicy.


What's it going to take to do this locally?


i think i’ll stick to nvidia broadcast for this


How much does it cost in the FAQ: Voice Isolator costs 1000 characters for every minute of audio.

Since when are characters a currency?


Their pricing page is full of weird uses of "character" as a unit too. Things like "monthly character limit", "additional character pricing", and so on.

https://elevenlabs.io/pricing

Nowhere is this novel usage of "character" defined. I know about text characters, and story characters. But this seems to be different. It's hard to imagine why they didn't define what they mean by "character" or just make the pricing model more straight-forward.


I know this is a common kneejerk reaction nowadays, but it's hard not to wonder if this is due to using AI to generate the text on these pages.


I assure you they didn't use an LLM to invent their pricing strategy. Elevenlabs subscription levels are designed around converting text-to-speech in the primary use-case. They charge by the character when converting text to speech, so characters are kind of the currency on their site. 1000 characters per minute makes sense in that context and I find it surprisingly expensive compared to generation


I still don't think I understand the pricing model based on your additional info. If characters are currency that you buy with real money, what does "characters per minute" mean? I guess if each character is $0.01 and then you want 20 minutes of audio, you can say that's $10.00 per minute for 20 minutes, but that means that the number of characters in the actual text wouldn't affect it at all, so...why even make up a different currency then?


Not to mention: What if the vocal portion of an audio clip doesn't translate to characters? What if it's all "oooh" and "ahhh," as in a choral segment?


"Audio generation consumes characters. 1k characters approximate 1 minute of audio. Character counts reset each billing cycle without rollover."


Do you not know of character meaning, for example, a single letter?


I know that meaning, as well as the meaning from stories. What I don't know is what this has to do with removing background noise from audio.


Ah I see. Have a look at the pricing page then: https://elevenlabs.io/pricing

Everything else they bill in characters, so that's the "currency" customers have. It works out I think as costing roughly the same per minute as generating audio.


“Costs 1000 characters for every minute of audio” suggests this is not about the text characters in the sample. It makes it sound like a form of digital credit.


Definitely a form of credit. Of course it seems to be rather obtuse but it's still possible to make some sense of it given the information on their pricing page.

The definitely-most-popular Creator price point is 100,000 "characters" for $22, meaning, according to the FAQ, 100 minutes of audio listening costs $22. Not sure why they can't just say "100 listening minutes" or whatever.

Though I just noticed they also claim that the 100k characters is ~120 minutes of audio but 30k is 30 minutes of audio. I'm not sure where they're getting their numbers but it looks like they're either being dishonest about the former or underselling the latter.


> The definitely-most-popular Creator price point is 100,000 "characters" for $22, meaning, according to the FAQ, 100 minutes of audio listening costs $22. Not sure why they can't just say "100 listening minutes" or whatever.

Because this part of the product is per minute, but the other (and earlier) one is charged per character of text.

> Though I just noticed they also claim that the 100k characters is ~120 minutes of audio but 30k is 30 minutes of audio. I'm not sure where they're getting their numbers but it looks like they're either being dishonest about the former or underselling the latter.

It's just a very rough estimate, the comparison points are "about 10 minutes, about half an hour, about 2 hours". I think that would be clearer if it said ~2 hours.


I believe that's exactly what he means with "text characters".. as in a character (or letter) of some text.


> text characters


This confused me as well, and made me lose interest. Intentional non-answers like this are rather grating.


The company started out doing text-to-speech and created different pricing tiers based on number of characters in the input text. Now they're branching out into other things, but want to keep the same pricing plans, so the unit is still characters.


That's lame, akin to selling cars by using pears as a unit of exchange.


It's like premium currency in games that don't want you to realize how much real money you are spending and premium currency packets are always constructed so that you overspend because you cannot get exact amount needed


Knowing how elevenlabs works, that's also pretty crazily expensive. Imagine being someone who has a 4hr podcast they want to feed through this. Oh ok just need 240,000 characters!


This comment gets me 3.06 seconds of noise removal.


I agree with your annoyance here.

I did end up clicking thru to get the full story. On their pricing page (https://elevenlabs.io/pricing) they are up front, with several monthly tiers; their "most popular" $11/month tier says "100k Characters/mo (~120 mins audio)"


This is par for the course for any text-to-speech or speech-to-text service these days, check out ElevenLabs' other service pricing and it is similar - they have monthly pricing but the character usage is capped at each level.

Actually other major cloud providers, including AWS, Azure and GCP, have similar character / token / word count based pricing as well.


Text-to-speech and speech-to-text, both involve text, which has characters. Notably, this does not.


I'm not trying to argue with you but just want to point out, this is why they made an equivalence between time and characters. We know 1min of audio == 1000 characters so now we know how to translate characters to time.


If that's really it, then why the obtuse intermediate conversion unit? Just bill by the minute.


They charge in characters per time, not money per character


[dead]


Looks good to me! Is this for video only or can you also upload m4a and mp3?


You can upload any audio or video file format.


just tried it and its not that great

many many ppl are complaining that they have to spend quite a bit of credit to get the desired effect

so likely this is just another "pay-to-fine-tune" not unlike "pay-to-play" schemes in online games--the hook is to get you in to buy credits which you will use to chase the desired quality.

besides there are local TTS models now that rivals Elevenlabs. Their pricing is ridiculous $200/1M is way too expensive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: