Also I think even an M3 Ultra is more cost effective at running LLMs than 4090 or 5090. Mostly due to being more energy efficient. And less fragile than running a gamer PC build.
It can run larger models quite slowly but lacks matmul acceleration (included in the M5) that is very useful for context and prompt performance at inference time. I will probably burn my budget with an M5 Max with 256gb (maybe even 512gb) memory, the price will be upsetting but I guess that is life!
Yes! I think smaller models on the M3 Ultra is interesting enough, but now with matmul/ tensors on M5 Ultra or Max, with decent unified mem, it will be a gamechanger.
I can easily imagine companies running Mac Studios in prod. Apple should release another Xserve.