It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
No, models significantly improved at the same cost. Last year's Claude 3.7 has since been beaten by GPT-OSS 120B that you can run locally and is much cheaper to train.
They justified it with the paper that states what you say, but that's exactly the problem. The statement of paper is significantly weaker than the claim that there's no progress without exponential increase in compute.
The statement of the the paper that SotA models require ever increasing compute, does not support "be careful when assuming that model capabilities will continue to grow" because it only speaks of ever growing models, but model capabilities of the models at the same compute cost continue growing too.
No, the comment is about "will", not "is". Of course there's no definitive proof of what will happen. But the writing is on the wall and the letters are so large now, that denying AI would take over coding if not all intellectual endeavors resembles the movie "Don't look up".
Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.
I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.
That said performance difference at 30M may not be representative of performance difference at 30B
There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.
comparisons will be run when the quality of generation will be on pair with other available models. It is useless to have preformance if the quality is not at lease on par.
The paper runs a bench (code and bench in the paper) to compare the performance with a causal attention GPT-2 model (nanoGPT) at inference (20% faster) and at training (equivalent for T and D larger than a threshold).
The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L.
So how do you scale this up from a toy problem? Well that L would
Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.)
So overall doesn’t seem to me like it’s gonna amount to anything.
also: precomputing a sparse Laplacian for N vectors at dimension D (NxD) is infinitely cheaper (if using `arrowspace`, my previous paper) than computing distances on the same full dense vectors billions of times.
There are published tests that compute a Laplacian on 300Kx384 space in 500 secs on a laptop on CPU.
So it is a trade-off: potentially few minutes of pretaining or hours of dot-product on dense matrices
the idea is to have a lot of "narrow" models to work with RAG instead of one model for all the knowledge domains or also distil the metadata that is currently in enterprise Knowledge Graphs
Bro is clueless. The opposite of what he believes is true: process barely changes outcomes comparatively to the models. No matter your process, you can't make Claude 3.5 do what 4.5 is capable of. There are tasks that you'd have to reroll 3.5 til the end of the universe that 4.5 would one shot.
Exactly. There is a big difference in code quality with state-of-the-art models versus 6 months ago. I'm strongly resisting the urge to run Claude Code in dangerous mode, but it's getting so good I may eventually cave.
But chatbots are sentient within a single context session!
reply