> What we need is an open and independent way of testing LLMs I mean, that's par...

Maxious · 2026-01-24T03:25:43 1769225143

Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?

https://www.anthropic.com/engineering/a-postmortem-of-three-...

judahmeek · 2026-01-24T21:06:01 1769288761

How hard is benchmarking models actually?

We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc

To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.

What am I missing here?