I think you have to make a distinction between transformers and neural networks ...

I think you have to make a distinction between transformers and neural networks in general, maybe also between training and inference.

Many/most types of neural network such as CNNs are well understood since there is a simple flow of information. e.g. In a CNN you've got a hierarchy of feature detectors (convolutional layers) with a few linear classifier layers on top. Feature detectors are just learning decision surfaces to isolate features (useful to higher layers), and at inference time the CNN is just detecting these hierarchical features than classifying the image based on combinations of these features. Simple.

Transformers seem qualitatively different in terms of complexity of operation, not least because it seems we still don't even know exactly what they are learning. Sure, they are learning to predict next word, but just like the CNN whose output classification is based on features learnt by earlier layers, the output words predicted by a transformer are based on some sort or world model/derived rules learned by earlier layers of the transformer, which we don't fully understand.

Not only don't we know exactly what transformers are learning internally (although recent interpretability work gives us a glimpse of some of the sorts of things they are learning), but also the way data moves through them is partially learnt rather than proscribed by the architecture. We have attention heads utilizing learnt lookup keys to find data at arbitrary positions in the context, and then able to copy portions of that data to other positions. Attention heads learn to coordinate to work in unison in ways not specified by the architecture, such as the "induction heads" (consecutive attention head pairs) identified by Anthropic that seem to be one of the work horses of how transformers are working and copying data around.

Additionally, there are multiple types of data learnt by a transformer, from declarative knowledge ("facts") that seem to mostly be learnt by the linear layers to the language/thought rules learnt by the attention mechanism that then affect the flow of data through the model, as discussed above.

So, it's not that we don't know how neural networks work (and of course at one level they all work the same - to minimize errors), but more specifically that we don't fully know how transformer-based LLMs work since their operation is a lot more dynamic and data dependent than most other architectures, and the complexity of what they are learning far higher.