Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> What we need is an open and independent way of testing LLMs

I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.





Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?

https://www.anthropic.com/engineering/a-postmortem-of-three-...


How hard is benchmarking models actually?

We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc

To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.

What am I missing here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: