More

lostmsu · 2026-01-20T11:47:56 1768909676

> Journalistic accounts describe individuals who have developed strong beliefs that chatbots are sentient

But chatbots are sentient within a single context session!

lostmsu · 2026-01-20T11:18:13 1768907893

My laptop runs gpt-oss 120B with none of that. Don't know how long though. I suspect a couple of hours continuous.

HWR_14 · 2026-01-20T11:29:32 1768908572

Which laptop?

lostmsu · 2026-01-20T11:56:26 1768910186

ROG Flow Z13 with maxxed out RAM.

lostmsu · 2026-01-19T16:00:52 1768838452

It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.

Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7

lostmsu · 2026-01-18T20:57:32 1768769852

It is until it IPOs or is profitable.

lostmsu · 2026-01-18T20:40:39 1768768839

No, models significantly improved at the same cost. Last year's Claude 3.7 has since been beaten by GPT-OSS 120B that you can run locally and is much cheaper to train.

judahmeek · 2026-01-19T01:33:11 1768786391

And GPT-OSS's architecture improvements aren't already incorporated in SotA models?

lostmsu · 2026-01-19T02:51:47 1768791107

The point is, that contradicts the claim that lately the progress is only made by throwing more compute.

judahmeek · 2026-01-19T13:39:26 1768829966

That wasn't the claim made.

The claim made was that improving SotA models has historically taken exponentially more compute.

The claim implies that improving SotA models takes more compute even while integrating technological advancements to make models more efficient.

Unless you think that such advancements have been historically ignored by the curators of SotA models?

lostmsu · 2026-01-19T15:10:28 1768835428

No, that was the claim made.

They justified it with the paper that states what you say, but that's exactly the problem. The statement of paper is significantly weaker than the claim that there's no progress without exponential increase in compute.

The statement of the the paper that SotA models require ever increasing compute, does not support "be careful when assuming that model capabilities will continue to grow" because it only speaks of ever growing models, but model capabilities of the models at the same compute cost continue growing too.

lostmsu · 2026-01-18T18:47:51 1768762071

Move to a swing state and vote.

lostmsu · 2026-01-18T18:40:13 1768761613

No, the comment is about "will", not "is". Of course there's no definitive proof of what will happen. But the writing is on the wall and the letters are so large now, that denying AI would take over coding if not all intellectual endeavors resembles the movie "Don't look up".

lostmsu · 2026-01-18T15:50:08 1768751408

Yes, it does sound like an AGPL (hopefully) version of Stripe + maybe the software stores like F-Droid. At least that's what I'd want this to be.

lostmsu · 2026-01-18T12:48:58 1768740538

Comparison with vanilla of the same size/flops budget?

Lerc · 2026-01-18T13:07:04 1768741624

I'm not sure if that is the right calculation.

Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.

I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.

That said performance difference at 30M may not be representative of performance difference at 30B

There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.

lostmsu · 2026-01-18T13:13:10 1768741990

So no comparison?

tuned · 2026-01-19T06:24:09 1768803849

comparisons will be run when the quality of generation will be on pair with other available models. It is useless to have preformance if the quality is not at lease on par.

The paper runs a bench (code and bench in the paper) to compare the performance with a causal attention GPT-2 model (nanoGPT) at inference (20% faster) and at training (equivalent for T and D larger than a threshold).

oofbey · 2026-01-18T20:11:34 1768767094

The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L.

So how do you scale this up from a toy problem? Well that L would Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.)

So overall doesn’t seem to me like it’s gonna amount to anything.

tuned · 2026-01-19T07:05:21 1768806321

also: precomputing a sparse Laplacian for N vectors at dimension D (NxD) is infinitely cheaper (if using `arrowspace`, my previous paper) than computing distances on the same full dense vectors billions of times. There are published tests that compute a Laplacian on 300Kx384 space in 500 secs on a laptop on CPU. So it is a trade-off: potentially few minutes of pretaining or hours of dot-product on dense matrices

tuned · 2026-01-19T06:30:01 1768804201

the idea is to have a lot of "narrow" models to work with RAG instead of one model for all the knowledge domains or also distil the metadata that is currently in enterprise Knowledge Graphs

lostmsu · 2026-01-17T23:50:37 1768693837

Bro is clueless. The opposite of what he believes is true: process barely changes outcomes comparatively to the models. No matter your process, you can't make Claude 3.5 do what 4.5 is capable of. There are tasks that you'd have to reroll 3.5 til the end of the universe that 4.5 would one shot.

9cb14c1ec0 · 2026-01-18T00:33:38 1768696418

Exactly. There is a big difference in code quality with state-of-the-art models versus 6 months ago. I'm strongly resisting the urge to run Claude Code in dangerous mode, but it's getting so good I may eventually cave.