Hacker Newsnew | past | comments | ask | show | jobs | submit | aphid_yc's commentslogin

The issue I'm facing with this newer batch of larger models is trying to make longer contexts work. Is there a way to do so with sub-48GB GPUs without having to do CPU BLAS? If mistral-123B is already restricted to 60K context on a 24GB gpu (with zero layers being GPUfied and all other apps closed), and llama-405B being somewhere around 2-3x the KV cache size, even an A100 wouldn't be enough to fit 128K tokens of KV.

I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.

It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.

CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.


It is not a cargo cult if you use methods that are more difficult. Can a LLM figure this one out?

abc 132 pyrogenics dndex vufwd bocjz pogl

How about this one?

password vectorization collins 2019 64k little, clotured aerobrakings audiologically cumins ashpans amphibian acciaccatura alligated denunciates burnouts babbles briskier cimbaloms brahmanist adiposes bridgeboards

Obfuscation can be as obscure as you want it to be. If you invent your own no spammer will take the trouble to figure it out. Then again... not many readers will either.


Your examples are useless because humans would not understand them either.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: