See https://news.ycombinator.com/item?id=32929029 re accuracy, I'm working on a wider comparison. My models are generally more robust than open-source models such as Vosk and Silero, but I'm definitely interested in how my stuff compares to Whisper on difficult held-out data.
> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).
Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.
There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.
> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.
It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).
Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.
There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.