I'm not so sure that view is very widespread amongst people familiar with how LLMs work. Certainly they become more capable with parameters and data, but there are fundamental things that can't be overcome with a basic model and I don't think anyone is seriously arguing otherwise.
For instance LLMs are pretty much stateless without their context window. If you treat the raw generated output as the first and final result then there is very little scope for any advanced consideration of anything.
If you give it a nice long context, give it the ability to edit that context or even access to a key-value function interface, then treat everything it says as internal monologue except for anything in <aloud></aloud> tags which is what the user gets to see. There are plenty of people who see AGI somewhere along that path, but once you take a step down that path, it's no-longer "Just an LLM" the LLM is a component in a greater system.
The problem with <aloud></aloud> is that you need the internal monologue to not be subject to training loss, otherwise the internal monologue is restricted to the training distribution.
Something people don't seem to grasp is that the training data mostly doesn't contain any reasoning. Nobody has published brain activity recordings on the internet, only text written in human language.
People see information, process it internally in their own head which is not subject to any outside authority and then serialize the answer to human language, which is subject to outside authorities.
Think of the inverse. What if school teachers could read the thoughts of their students and punish any student that thinks the wrong thoughts. You would expect the intelligence of the class to rapidly decline.
That does sounds invasive, but on the other hand, math teachers do tell the kids to “show their work” for good reasons. And the consent issues don’t apply for LLM training.
I wonder if the trend towards using synthetic, AI-generated training data will make it easier to train models that use <aloud> effectively? AI’s could be trained to use reasoning and show their work more than people normally do when posting on the Internet. It’s not going to create information out of nothing, but it will better model the distribution that the researchers want the LLM to have, rather than taking distributions found on the Internet as given.
It’s not a natural distribution anyway. For example, I believe it’s already the case that people train AI with weighted distributions - training more on Wikipedia, for example.
My guess is that the quest for the best training data has only just begun.
I think you are looking at a too narrowly defined avenue to achieve this effect.
There are multiple avenues to train a model to do this. Most simply is a finetune on training examples where the the internal monologue is constructed in a manner that precedes the <aloud> tag and provides additional reasoning before the output.
I think there is also scope for pretraining with a mask to not attempt to predict (or ignore the loss, same thing) certain things in the stream. For example to give time codes into the data stream. The training could then have an awareness of the passing of time but would not generate time codes as a prediction.
Time codes could then be injected into the context at inference time and it would be able to use that data.
I noticed some examples from anthropic's golden-gate-claude paper had responses starting with <scratchpad> for the inverse effect. Suppressing the output to the end of the paragraph would be an easy post processing operation.
It's probably better to have implicitly closed tags rather than requiring a close tag. It would be quite easy for a LLM to miss a close tag and be off in a dreamland.
Possibly addressing comments to the user or itself might allow for considering multiple streams of thought simultaneously. IRC logs would be decent training data for it to figure out many voice multi-conversations (maybe)
For instance LLMs are pretty much stateless without their context window. If you treat the raw generated output as the first and final result then there is very little scope for any advanced consideration of anything.
If you give it a nice long context, give it the ability to edit that context or even access to a key-value function interface, then treat everything it says as internal monologue except for anything in <aloud></aloud> tags which is what the user gets to see. There are plenty of people who see AGI somewhere along that path, but once you take a step down that path, it's no-longer "Just an LLM" the LLM is a component in a greater system.