Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm surprised no GPU cards are available with like a TB of older/cheaper RAM.


Not surprising at all: Nvidia doesn't want to compete with their own datacenter cards.


AMD could arguably do it. But they have to focus to stay above water at all, and "put 128GB or more of DDR5 ram on any previous-gen GPU" is probably not in their focus. With the state of their software it's not even certain if the community could pick up the slack and turn that into a popular solution.


Their next generation of APUs will have a lot more memory bandwidth and there will probably be lots of AMD APU laptops with 64GB+ of RAM that can use HW acceleration and not be artificially segmented the way Nvidia can do it with VRAM being soldered.


Nvidia upcoming 'minipc' has shared ram up to 128gb for around 3k. No a competitor but pretty good for enthusiast.

Hopefully is at least quadchannel.


Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.

You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.

You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.

Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.


> Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.


You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.


You can offload tensors to the cpu memory. It will make your model run much slower but it will work




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: