Because they are getting better. They're still far from perfect/AGI/ASI, but when was the last time you saw the word "delve"? So the models are clearly changing, the question is why doesn't the data show That they're actually better?
Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?
> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.
"They don't use delve anymore" is not really a testament that they became better.
Most of what I can do now with them I could do half a year to a year ago. And all the mistakes and fail loops are still there, across all models.
What changed is the number of magical incantations we throw at these models in the form of "skills" and "plugins" and "tools" hoping that this will solve the issue at hand before the context window overflows.
"They dont say X as often anymore" is just a distraction, it has nothing to do with actual capability of the model.
Unfortunately, I think that the overlap between actual model improvements and what people perceive as "better" is quite small. Combine this with the fact that most people desperately want to have a strong opinion on stuff even though the factual basis is very weak.. "But I can SEE it is X now".
The type of person who outsources their thinking to their social media feed news stories and isn't intellectually curious enough to deeply explore the models themselves in order for the models to display their increase in strength, isn't going to be able to tell this themselves.
I would think this also correlates with the type of person who hasn't done enough data analysis themselves to understand all the lies and misleading half-truths "data" often tells. In the reverse also, that experience with data inoculates one to some degree against the bullshitting LLM so it is probably easier to get value from the model.
I would imagine there are all kinds of factors like this that multiple so some people are really having vastly different experiences with the models than others.
Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?
> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.
https://articles.data.blog/2024/03/30/jeff-bezos-when-the-da...