I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).
The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"
The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"