CUDA worked fine with large on my 2080Ti FWIW. The speedup is ridiculous, as expected. My Ryzen 3800X used almost an hour transcribing a minute worth of speech, while the 2080Ti does it in like 10-20 seconds.
I'm on Windows, using Task Manager, the dedicated GPU memory went from 1GB before run to about 9.8GB for the most time during run, peaking at 10.2GB. So pretty close to the 11GB limit of my 2080Ti it seems.