Well it's good to see they are showcasing examples where the model really fails ...

zahlman · 2025-09-12T08:10:21 1757664621

> - Case 16 labels the tricuspid in the wrong place and I have no idea what a "mittic" is

> - Case 27 shows the usual "models can't do text" though I'm not holding that against it too much

16 makes it seem like it can "do text" — almost, if we don't care what it says. But it looks very crisp until you notice the "Pul??nary Artereys".

I'd say the bigger problem with 27 is that asking to add a watermark also took the scroll out of the woman's hands.

(While I'm looking, 28 has a lot of things wrong with it on closer inspection. I said 26 originally because I randomly woke up in the middle of the night for this and apparently I don't know which way I'm scrolling.)

voidUpdate · 2025-09-12T08:20:58 1757665258

EDIT: Yeah, on closer inspection, 28 is definitely a bit screwy. I wasn't clicking on the images themselves to view the enlarged ones, and from the preview I didn't see anything that immediately jumped out at me. I have no idea what that line at the bottom is meant to represent!

Also you're right, I didn't notice the scroll had gone, though on another inspection, it's also removed the original prompter's watermark

iyk · 2025-09-12T10:42:35 1757673755

In Case 16 (diagram of the heart), every single label (aside from the superior vena cava) is incorrect.

muzani · 2025-09-12T10:49:23 1757674163

Yeah, I appreciate this kind of benchmarking too. That other Gen AI Showdown in the comments also does a good job with this - mentions that it was best of 8 attempts and so on.

lm28469 · 2025-09-12T08:19:13 1757665153

47 is also very questionable

48 is impossible to do in a way that is accurate and meaningful