> The obvious alternative to gradient descent here would be Bayes Formula
If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.
No, I don't "want a bit of structure" ... I want a predictive architecture that supports online learning. So far the only one I'm aware of is the cortex.
Not sure what approaches you are considering as dead ends, but RNNs still have their place (e.g. Mamba), depending on what you are trying to achieve.
If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.