I've seen a few suspect benchmarks for recent announcements of LLM releases. I'm...

I've seen a few suspect benchmarks for recent announcements of LLM releases. I'm sure they made an attempt at an honest benchmark, but until there's an independent assessment and benchmark (preferably multiple) you have to assume that there's bias in anyone's self published benchmarks like this.

I'm guessing it's a fine-tuning of some existing LLM model or API, but this largely seems to be an "agent" and UI that includes some SWE like workflow coding to allow more complex requests to be asked than just an LLM could provide.