I don’t care about the voices themselves but the speech recognition is borderline unusable sometimes. It interjects when it shouldn’t and will frequently hear things incorrectly.
At one point it misinterpreted me mentioning “tai chi” as “I can’t breathe” and responded with advice about medical emergencies.
Do you mean Siri's voice recognition? If so, 100% agreed. My iOS shortcut uses OpenAI's Whisper API for voice recognition, and Siri (English United Kingdom - Siri Voice 1) for text to speech.
I really like dictating things sometimes, and Whisper is perfect for that (automatic paragraphs inside the model itself would be nice but not a big deal).
If anyone is interested - the "Whisper speech recognition in iOS" part is based on this shortcut I found that you can easily use yourself on both iOS and MacOS (free except for the OpenAI API usage fees obviously): https://giacomomelzi.com/transcribe-audio-messages-iphone-ai...
There are several versions of Whisper which have been distilled and can run locally, so I don’t see what advantage making API calls would be other than increased latency and decreased reliability and data security.
That's really interesting, Whisper is generally considered the current state of the art in STT and I've personally never experienced errors like the ones you describe. I've actually never had an error from Whisper.
First question, is there another STT you have used which works better for you?
Second question, is there any reason your voice might be considered unusual, like having a strong Welsh, Irish, or Indian accent, or being Deaf or Hard of Hearing?
Yeah, whisper is pretty good out of the box in my experience, but the vast majority of the time I’m using it in my car. So the conditions aren’t ideal, or are out of distribution for Whisper. However CarPlay is detectable and common enough from what I’ve heard.
Second, even if the transcription is correct, it cuts me off at inappropriate times. It’s hard to talk naturally without pauses.
Oh that's really interesting. Probably an acoustic environment it's not used to, like you said, but also people talk differently when they're driving. Like the cadence of our speech is significantly different because of the way our mental focus changes. I have to imagine that changes some things.
At one point it misinterpreted me mentioning “tai chi” as “I can’t breathe” and responded with advice about medical emergencies.