The issue I'm facing with this newer batch of larger models is trying to make longer contexts work. Is there a way to do so with sub-48GB GPUs without having to do CPU BLAS? If mistral-123B is already restricted to 60K context on a 24GB gpu (with zero layers being GPUfied and all other apps closed), and llama-405B being somewhere around 2-3x the KV cache size, even an A100 wouldn't be enough to fit 128K tokens of KV.
I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.
It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.
CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.
Obfuscation can be as obscure as you want it to be. If you invent your own no spammer will take the trouble to figure it out. Then again... not many readers will either.
I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.
It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.
CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.