Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.



It seems these days that language-oriented models are commonly becoming multilingual by default. There are a lot of common threads when understanding sentence construction between different languages. French and English have different rules but they will still have things like nouns, adjectives, subjects, prepositions, etc. It seems that by training models on many languages you get both a more robust understanding of language, and it saves you the trouble of having to make many more localized models for every language. I also believe that the other languages help the models construct sentences in languages which have very small training sets. If it has a few examples in a rare language as well as good translations to a better-known language, then it can provide good support for the rare language.

We also see in image generation models that multi-modal networks are more powerful than single purpose networks. As we move towards more advanced AI systems I suspect we will see more and more generalizable networks with distinct advantages over separate networks that get plugged together.


Would a multilingual modal perhaps also be better at understanding non-natives speech?


Good question but I don’t know the answer.


My understanding is that multi-modal models are the primary focus of OpenAI right now, due to their stated goal of achieving AGI. This product is probably better thought of as an offshoot of their work to create a fully generalizable model, rather than a specific attempt to provide translation/transcription services.


Judging from the chart in their github README, Whisper performs much better in parsing Spanish audio than any other language and that in particular blows my mind. I would have expected English to be at the top of any such model, it being such an IT lingua franca.

Now I wonder if it works equally well with Spanish from Spain (and its different regions) and Spanish from the New World (and in its myriads of different flavours).


It sounds useful to me because you can use tone information to help with the translation, which text-to-text translation can't do. But I'm not sure if that's how this model actually works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: