I'm surprised no GPU cards are available with like a TB of older/cheaper RAM.

gr3ml1n · 2025-01-30T15:51:36 1738252296

Not surprising at all: Nvidia doesn't want to compete with their own datacenter cards.

wongarsu · 2025-01-30T16:17:22 1738253842

AMD could arguably do it. But they have to focus to stay above water at all, and "put 128GB or more of DDR5 ram on any previous-gen GPU" is probably not in their focus. With the state of their software it's not even certain if the community could pick up the slack and turn that into a popular solution.

hnuser123456 · 2025-01-30T16:44:13 1738255453

Their next generation of APUs will have a lot more memory bandwidth and there will probably be lots of AMD APU laptops with 64GB+ of RAM that can use HW acceleration and not be artificially segmented the way Nvidia can do it with VRAM being soldered.

papichulo2023 · 2025-01-30T16:37:11 1738255031

Nvidia upcoming 'minipc' has shared ram up to 128gb for around 3k. No a competitor but pretty good for enthusiast.

Hopefully is at least quadchannel.

aurareturn · 2025-01-30T16:42:17 1738255337

Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.

You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.

You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.

Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.

acoard · 2025-01-30T18:45:42 1738262742

> Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.

aurareturn · 2025-01-30T20:34:26 1738269266

You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.

elcomet · 2025-01-30T21:23:12 1738272192

You can offload tensors to the cpu memory. It will make your model run much slower but it will work