I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)
I don't think the salience of this problem is that it's a supposedly unfixable blind spot. It's an illustrative failure in that it breaks the illusory intuition that something that can speak and write to us (sometimes very impressively!) also thinks like us.
Nobody who could give answers as good as ChatGPT often does would struggle so much with this task. The fact that an LLM works differently from a whole-ass human brain isn't actually surprising when we consider it intellectually, but that habit of always intuiting a mind behind language whenever we see language is subconscious and and reflexive. Examples of LLM failures which challenge that intuition naturally stand out.
You can already do it with arbitrary strings that aren't in the dictionary. But I wonder if the pattern matching will break once strings are much longer than any word in the dictionary, even if there's plenty of room left in context and all that.
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)