I am highly skeptical of LLMs as a mechanism to achieve AGI, but I also find this paper fairly unconvincing, bordering on tautological. I feel similarly about this as to what I've read of Chalmers - I agree with pretty much all of the conclusions, but I don't feel like the text would convince me of those conclusions if I disagreed; it's more like it's showing me ways of explaining or illustrating what I already believed.
On embodiment - yes, LLMs do not have corporeal experience. But it's not obvious that this means that they cannot, a priori, have an "internal" concept of reality, or that it's impossible to gain such an understanding from text. The argument feels circular: LLMs are similar to a fake "video game" world because they aren't real people - therefore, it's wrong to think that they could be real people? And the other half of the argument is that because LLMs can only see text, they're missing out on the wider world of non-textual communication; but then, does that mean that human writing is not "real" language? This argument feels especially weak in the face of multi-modal models that are in fact able to "see" and "hear".
The other flavor of argument here is that LLM behavior is empirically non-human - e.g., the argument about not asking for clarification. But that only means that they aren't currently matching humans, not that they couldn't.
Basically all of these arguments feel like they fall down to the strongest counterargument I see proposed by LLM-believers, which is that sufficiently advanced mimicry is not only indistinguishable from the real thing, but at the limit in fact is the real thing. If we say that it's impossible to have true language skills without implicitly having a representation of self and environment, and then we see an entity with what appears to be true language skills, we should conclude that that entity must contain within it a representation of self and environment. That argument doesn't rely on any assumptions about the mechanism of representation other than a reliance on physicalism. Looking at it from the other direction, if you assume that all that it means to "be human" is encapsulated in the entropy of a human body, then that concept is necessarily describable with finite entropy. Therefore, by extension, there must be some number of parameters and some model architecture that completely encode that entropy. Questions like whether LLMs are the perfect architecture or whether the number of parameters required is a number that can be practically stored on human-manufacturable media are engineering questions, not philosophical ones: finite problems admit finite solutions, full stop.
Again, that conclusion feels wrong to me... but if I'm being honest with myself, I can't point to why, other than to point at some form of dualism or spirituality as the escape hatch.
> sufficiently advanced mimicry is not only indistinguishable from the real thing, but at the limit in fact is the real thing
I am continually surprised at how relevant and pervasive one of Kurt Vonnegut’s major insights is: “we are what we pretend to be, so we must be very careful about what we pretend to be”
Everyone in the "life imitates art, not the other way around" camp (and also neo-platonists/gnostics i.e. https://en.wikipedia.org/wiki/Demiurge ) is getting massively validated by the modern advances in AI right now.
Isn't any formal "proof" or "reasoning" that shows that something cannot be AGI inherently flawed, because we have a hard time formally describing what AGI is anyway.
Like your argument: embodiment is missing in LLMs, but is it needed for AGI? Nobody knows.
I feel we first have to do a better job defining the basics of intelligence, we can then define what it means to be an AGI, and only then can we prove that something is, or is not, AGI.
It seems that we skipped step 1 because its too hard, and jumped straight to step 3.
Yep, this is a big part of it. Intelligence and consciousness are barely understood beyond "I'll know it when I see it", which doesn't work for things you can't see - and in the case of consciousness, most definitions are explicitly based on concepts that are not only invisible but ineffable. And then we have no solid idea whether these things we can't really define, detect, or explain are intrinsically linked to each other or have a causal relationship in either direction. Almost any definition you pick is going to lead to some unsatisfying conclusions vis a vis non-human animals or "obviously not intelligent" forms of machine learning.
To me LLMs seem to most closely resemble the regions of the brain used for converting speech to abstract thought and vice-versa, because LLMs are very good at generating natural language and knowing the flow of speech. An LLM is similar to if you took the the Wernicke's and Broca's Areas and stuck a regression between them. The problem is that the regression in the middle is just a brute force of the entire world's knowledge instead of a real thought.
I think the major lessons from the success of LLMs are two: 1) the astonishing power of a largely trivial association engine based only on the semantic categories inferred by word2vec, and 2) that so much of the communication abilities of the human mind require so little rational thought (since LLMs demonstrate essentially none of the skills in Kahneman's and Tversky's System 2 thinking (logic, circumspection, self-correction, reflection, etc).
I guess this also disproves Minsky's 'Society of Mind' conjecture - a large part of human cognition (System 1) does not require the complex interaction of heterogeneous mental components.
> that so much of the communication abilities of the human mind require so little rational thought
Beyond that fact that human minds created them, I doubt that LLMs can tell us anything about the abilities of the human mind or what language in humans requires.
The most we can learn from LLMs about ourselves will be in how we react to them, or more broadly, what the datasets LLMs use show about who we are.
What makes this tough is that LLMs can show logical thinking and self-correction when specifically prompted (e.g. "think step by step", "double-check and then correct your work"). It seems unlikely that they can truthfully self-reflect, but I don't think it's strictly impossible.
> LLMs can show logical thinking and self-correction
The same way they "show" sadness or contrition or excitement?
We need to be careful with our phrasing here: LLMs can be prompted to provide you associated phrases that usually seem to fit with the rest of the word-soup, but whether the model is actually demonstrating "logical thinking" or "self-correction" is a Chinese Room problem [0]. (Or else a "No, it doesn't, I can tell because I checked the code.")
>that sufficiently advanced mimicry is not only indistinguishable from the real thing, but at the limit in fact is the real thing.
While sufficiently does a lot of the heavy lifting here, the indistinguishable criteria implicitly means there must be no-way to tell if it is not the real thing. The belief that it is the real thing comes from the intuition that anything that can be everything a person must be, but have that fundamental essence of being a person. I don't think people could really conceive an alternative without resorting to prejudice which they could equally apply to machines or people.
I take the arguments such as in this paper to be instead making the claim that because X cannot be Y you will never be able to make X indistinguishable from Y. It is more a prediction of future failure than a judgment on an existing thing.
I end up looking at some of these complaints from the point of view of my sometimes profession of Game Developer. When I show someone a game in development to playtest they will find a bunch of issues. The vast majority of those issues, not only am I already aware of, but I have a much more detailed perspective of what the problem is and how it might be fixed. I have been seeing the problem, over and over, every day as I work. The problem persists because there are other things to do before fixing the issue, some of which might render the issue redundant anyway.
I feel like a lot of the criticisms of AI are like this they are like the playtesters pointing out issues in the current state where those working on the problems are generally well aware of particular issues and have a variety of solutions in mind that might help.
Clear statements of deficiencies in ability are helpful as a guide to measure future success.
I'm also in the camp that LLM's cannot be an AGI on its own, on the other hand I do think the architecture might be extended to become one. There is an easy out for any criticism to say, "Well, it's not an LLM anymore".
In a way that ends up with a lot of people saying
.The current models cannot do the things we know the current models cannot do
.Future models will not be able to do those things if they are the same as the current ones
.Therefore the things that will be able to do those things will be different
> Future models will not be able to do those things if they are the same as the current ones
I think a lot of people disagree with this. People think if we just keep adding parameters and data, magic will happen. That’s kind of what happened with ChatGPT after all.
I'm not so sure that view is very widespread amongst people familiar with how LLMs work. Certainly they become more capable with parameters and data, but there are fundamental things that can't be overcome with a basic model and I don't think anyone is seriously arguing otherwise.
For instance LLMs are pretty much stateless without their context window. If you treat the raw generated output as the first and final result then there is very little scope for any advanced consideration of anything.
If you give it a nice long context, give it the ability to edit that context or even access to a key-value function interface, then treat everything it says as internal monologue except for anything in <aloud></aloud> tags which is what the user gets to see. There are plenty of people who see AGI somewhere along that path, but once you take a step down that path, it's no-longer "Just an LLM" the LLM is a component in a greater system.
The problem with <aloud></aloud> is that you need the internal monologue to not be subject to training loss, otherwise the internal monologue is restricted to the training distribution.
Something people don't seem to grasp is that the training data mostly doesn't contain any reasoning. Nobody has published brain activity recordings on the internet, only text written in human language.
People see information, process it internally in their own head which is not subject to any outside authority and then serialize the answer to human language, which is subject to outside authorities.
Think of the inverse. What if school teachers could read the thoughts of their students and punish any student that thinks the wrong thoughts. You would expect the intelligence of the class to rapidly decline.
That does sounds invasive, but on the other hand, math teachers do tell the kids to “show their work” for good reasons. And the consent issues don’t apply for LLM training.
I wonder if the trend towards using synthetic, AI-generated training data will make it easier to train models that use <aloud> effectively? AI’s could be trained to use reasoning and show their work more than people normally do when posting on the Internet. It’s not going to create information out of nothing, but it will better model the distribution that the researchers want the LLM to have, rather than taking distributions found on the Internet as given.
It’s not a natural distribution anyway. For example, I believe it’s already the case that people train AI with weighted distributions - training more on Wikipedia, for example.
My guess is that the quest for the best training data has only just begun.
I think you are looking at a too narrowly defined avenue to achieve this effect.
There are multiple avenues to train a model to do this. Most simply is a finetune on training examples where the the internal monologue is constructed in a manner that precedes the <aloud> tag and provides additional reasoning before the output.
I think there is also scope for pretraining with a mask to not attempt to predict (or ignore the loss, same thing) certain things in the stream. For example to give time codes into the data stream. The training could then have an awareness of the passing of time but would not generate time codes as a prediction.
Time codes could then be injected into the context at inference time and it would be able to use that data.
I noticed some examples from anthropic's golden-gate-claude paper had responses starting with <scratchpad> for the inverse effect. Suppressing the output to the end of the paragraph would be an easy post processing operation.
It's probably better to have implicitly closed tags rather than requiring a close tag. It would be quite easy for a LLM to miss a close tag and be off in a dreamland.
Possibly addressing comments to the user or itself might allow for considering multiple streams of thought simultaneously. IRC logs would be decent training data for it to figure out many voice multi-conversations (maybe)
One of the issues here is that future-focused discussions often lead to wild speculation because we don’t know the future. Also, there’s often too much confidence in people’s preferred predictions (skeptical or optimistic) and it would be less heated if we admitted that we don’t know how things will look even a couple of years out, and alternative scenarios are reasonable.
So I think you’re right, it’s not enlightening. Criticism of overconfident predictions won’t be enlightening if you already believe that they’re overconfident and the future is uncertain. Conversations might be more interesting if not so focused on bad arguments of the other side.
But perhaps such criticism is still useful. How else do you deflate excessive hype or skepticism?
> LLMs do not have corporeal experience. But it's not obvious that this means that they cannot, a priori, have an "internal" concept of reality, or that it's impossible to gain such an understanding from text.
I would argue it is (obviously) impossible the way the current implementation of models work.
How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?
Humans and animals have an obvious conceptual understanding of the world. Before we "emit" a word or a sentence, we have an idea of what we're going to say. This is obvious when talking to children, who know something and have a hard time saying it. Clearly, language is not the medium in which they think or develop thoughts, merely an imperfect (and often humorous) expression of it.
Not so with LLMs!! Generative LLMs do not have a prior concept available before they start emitting text. That the "temperature" can chaotically change the output as the tokens proceed just goes to show there is no pre-existing concept to reference. It looks right, and often is right, but generative systems are basically always hallucinating: they do not have any concepts at all. That they are "right" as often as they are is a testament to the power of curve fitting and compression of basis functions in high dimensionality spaces. But JPEGs do the same thing, and I don't believe they have a conceptual understanding of pictures.
Transformer models have been shown to spontaneously form internal, predictive models of their input spaces. This is one of the most pervasive misunderstandings about LLMs (and other transformers) around. It is of course also true that the quality of these internal models depends a lot on the kind of task it is trained on. A GPT must be able to reproduce a huge swathe of human output, so the internal models it picks out would be those that are the most useful for that task, and might not include models of common mathematical tasks, for instance, unless they are common in the training set.
Have a look at the OthelloGPT papers (can provide links if you're interested). This is one of the reasons people are so interested in them!
> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?
Could a creature that simply evolved to survive and reproduce possibly have a conceptual model underpinning it? Model training and evolution are very different processes, but they are both ways of optimizing a physical system. It may be the case that evolution can give rise to intelligence and model training can’t, but we need some argument to prove that.
> Could a creature that simply evolved to survive and reproduce possibly have a conceptual model underpinning it?
Yes. Obviously. I can create plans and think on them. I can think without the need for an internal monologue or talking to myself. This has nothing to do with modalities either: I do not think through text. I use text when I do as a way of conveying the thoughts I already have.
Anyone who claims we don't have the ability to form concepts in our head distinct from the medium in which they're transmitted is saying we're effectively ears, eyes, and skin. That the modality is what is important for intelligence.
This is clearly false, and academic silliness aside, yes—emphatically—humans and intelligent agents have internal concepts and models of the world.
> generative systems are basically always hallucinating: they do not have any concepts at all. That they are "right" as often as they are is a testament to the power of curve fitting and compression of basis functions in high dimensionality spaces
It's refreshing to read someone who "got it". Sad that before my upvote the comment was grayed out.
Any proponent of conceptual or other wishful/magical thinking shoud come with proofs, since it is the hypothesis that diverge from the definition of a LLM.
The argument would be that that conceptual model is encoded in the intermediate-layer parameters of the model, in a different but analogous way to how it's encoded in the graph and chemical structure of your neurons.
I agree that's an argument. I would contend that argument is obviously false. If it were true, LLMs could multiply scalar numbers together trivially. It should be the easiest thing in the world for them. The network required to do that well is extremely small, the parameter sizes of these models are gigantic, and the textual expression is highly regular: multiplication is the simplest concept imaginable.
That they cannot do that basic task implies to me that they have almost no conceptual understanding unless the fit is almost memorizable or the space is highly regular. That LLMs can't multiply numbers properly isn't surprising if they don't really understand concepts prior to emitting text. Where they do logical tasks, that can be done with minimal or no understanding, because syllogisms and logical formalisms are highly structured in text arguments.
Multiplication requires O(n^2) complexity with the usual algorithm used by humans, LLMs have a constant amount of computation available and they are not really efficient machines for math evaluation. They can definitely evaluate unseen expressions and you train a neural network to learn how to do sums and multiplications, I have trained models on sums and they are able to do sums never seen during training, the model learns the algorithm just by giving it inputs and outputs.
LLMs do contain conceptual representations and LLMs are capable of abstract reasoning. This is trivially provable by asking them to reason about something that is a) purely abstract and b) not in the training data, e.g. "All floots are gronks. Some gronks are klorps. Are any floots klorps?" Any of the leading LLMs will correctly answer questions of this type much more often than chance.
This does not indicate abstract reasoning. I said:
> Where they do logical tasks, that can be done with minimal or no understanding, because syllogisms and logical formalisms are highly structured in text arguments.
There is an enormous amount of text in the training set that is structured in the way you said such that syntactic replacement would be effective. That is also unsurprising and does not represent abstract reasoning any more than "King - Man + Woman = Queen" in word2vec. It's showing that there's high degrees of structure in syllogisms, and that it need know nothing about what a gronk, floot, or klorp is at all because the structure of the syllogism is repeated all over the internet.
"All floots are gronks. Some gronks are klorps. Are any floots klorps?"
------
To determine if any floots are klorps, let's analyze the given statements:
1. All floots are gronks. This means every floot falls into the category of gronks.
2. Some gronks are klorps. This means there is an overlap between the set of gronks and the set of klorps.
Since all floots are included in the set of gronks and some gronks are klorps, it is possible that some floots are klorps. However, we cannot conclusively say that any floots are klorps without additional information. It is only certain that if there is any overlap between floots and klorps, it is possible, but not guaranteed, that some floots are klorps.
Nope, ChatGPT was right, the answer is indeterminable. The klorps that are gronks could be a wholly distinct subset to the klorps that are floots. It also correctly evaluates "All gronks are floots. Some gronks are klorps. Are any floots klorps?", to which the answer is definitively yes.
> The klorps that are gronks could be a wholly distinct subset to the klorps that are floots.
So? It's still the case that "if there is any overlap between floots and klorps," it is "guaranteed, that some floots are klorps." It's tautological.
Unless there's a way to read "overlap" so that it doesn't mean "some of one category are also in the other category, and vice versa"?
Oh, when I said "it's necessarily true" I was refering to this last sentence of the output, not the question posed in the input. Hence we are at cross purposes I think.
That is not an example of a LLM being capable of abstract reasoning. Changing the question from "What is the capital of United States?" which is easily answerable to something completely abstract and "not in the training model" doesn't change that LLM's are just very advanced text prediction, and always will be. The nature of their design means they are incapable of AGI.
> LLM's are just very advanced text prediction, and always will be
How do you predict the next word in answering an abstract logic question without being capable of abstract reasoning, though?
In some sense it probably is possible, but this is a gaping flaw in your argument. A sufficiently advanced text prediction process has to encompass the process of abstract reasoning. The text prediction problem is necessarily a superset of the abstract reasoning problem. Ie, in the limit text prediction is fundamentally harder than abstract reasoning.
The question I gave is a literal textbook example of abstract reasoning. LLMs are just very advanced text prediction, but they are also provably capable of abstract reasoning. If you think that those statements are contradictory, I would encourage you to read up on the Bayesian hypotheses in cognitive science - it is highly plausible that our brains are also just very advanced prediction models.
You're quite right that LLMs can seemingly do some abstract reasoning problems, but I would not say they aren't in the training data.
Sure, the exact form using the made up word gronk might not be in the training data, but the general form of that reasoning problem definitely exists, quite frequently in fact.
```
You will be given a name of an object (such as Car, Chair, Elephant) and a letter in the alphabet. Your
goal is to first produce a 1-line description of how that object can be combined with the letter in an
image (for example, for an elephant and the letter J, the trunk of the elephant can have a J shape, and
for the letter A and a house, the house can have an A shape with the upper triangle of the A being the
roof). Following the short description, please create SVG code to produce this (in the SVG use shapes
like ellipses, triangles etc and polygons but try to defer from using quadratic curves).
```
```
Round 5: A car and the letter E.
Description: The car has an E shape on its front bumper, with the horizontal lines
of the E being lights and the vertical line being the license plate.
```
How does it "just" predict the letter E could be used in such a way to draw a car? How does it just text predict working SVG code that draws the car made out of basic shapes and the letter E?
I don't know how anyone could suggest there are no conceptual models embedded in there.
Yes, but the general form of the problem tells you nothing about the answer to any specific case. To perform any better than chance, the model has to actually reason through the problem.
Pleasure and pain, along with subtler emotions that regulate our behavior, aren't things that arise from word prediction, or even from understanding the world, I don't think. So to say human brains are just prediction models seems like a mischaracterization.
This isn't something I should convince you of. Just open up ChatGPT or Claude and try it for yourself. Think up a batch of your own questions and see how a modern LLM fares. I assure you that it'll do much better than chance. If you're so inclined, you can run enough tests to achieve statistical significance in the course of your lunch break.
It depresses me that we seem to be spending more time arguing and hypothesising about LLMs than empirically testing them. The question of whether LLMs can think is completely settled, as their performance at zero-shot problems is simply impossible through pure memorisation or pattern-matching. The question that remains is far more interesting - how do they think?
Given their training set, our hypothesis so far should be that they're just tweaking things they've already seen by applying a series of simple rules. They're still not doing what human beings do. We have introspection, creativity operating outside what we've seen, modeling others' thoughts, planning in new domains, and so on. We also operate without hallucination most of the time. I've yet to see an A.I. do all of this reliably and consistently. Then, that it did that without training input similar to the output.
So, they don't just pattern match or purely memorize. They do more than that. They do way less than humans. Unlike humans, they also try to do everything with one or a few components vs our (100-200?) brain components. Crossing that gap might be achievable. It will not be done by current architectures, though.
Using Occam's razor, that is less probable than the model picking up on statistical regularities in human language, especially since that's what they are trained to do.
That's hard to conclude from Occam's razor here. Or, "statistical regularities" may have less explanatory power than you think, especially if the simplest statistical regularity is itself a fully predictive understanding of the concept of temperature.
> I would argue it is (obviously) impossible the way the current implementation of models work.
> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?
Any probability distribution over strings can theoretically be factored into a product of such a “probability that next token is x given that the text so far is y”. Now, whether a probability distribution over strings can efficiently computed in this form, is another question. But, if we are being so theoretical that we don’t care about the computational cost (as long as it is finite), then the “it is next token prediction” can’t preclude anything which “it produces a probability distribution over strings” doesn’t already preclude.
As for the temperature, given any probability distribution over a discrete set, we can modify it by adding a temperature parameter. Just take the log of the probabilities according to the original probability distribution, scale them all by a factor (the inverse of the temperature, I think. Either that or the temperature, but I think it is the inverse of the temperature.), then exponentiate each of these, and then normalize to produce a probability distribution.
So, the fact that they work by next token prediction, and have a temperature parameter, cannot imply any theoretical limitation that wouldn’t apply to any other way of expressing a probability distribution over strings, as far as discussing probability distributions in the abstract, over strings, rather than talking about computational processes that implement such probability distributions over strings.
But also like,
going between P(next token is x | initial string so far is y) and P(the string begins with z) , isn’t that computationally costly?
Well, in one direction anyway.
Because like, P(next token is x|string so far is y) = P(string begins with yx) / P(string begins with y) .
Though, one might object to P(string starts with y) over P(string is y) ?
> Any probability distribution over strings can theoretically be factored into a product of such a “probability that next token is x given that the text so far is y”.
And such a probability distribution would not generally understand concepts, efficient or otherwise. The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide. They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.
The academic exercise here is similar to monads: "yes, any computed function f(x) can be expressed as a sufficiently pre-computed large lookup table." With LLMs we're dealing with approximate lookups due to lossy compression, but that's still what these prior probabilities are: lookup tables. Lookup tables are not smart, do not understand concepts, and they have little to no capacity to generate new results not sufficiently represented in the training set.
My main concern here is the theoretical point, and so I’m not addressing the “this is what current (e.g. transformer based) models do” parts.
> The P(next_token) is based upon the syntactical structure built via the model and some basic semantic distance that LLMs provide.
Regardless of whether this is true for existing transformer-based models, this is not true for all computable conditional probability distributions over text.
Any computable task can be framed as sampling from some conditional probability distribution. (If the task is deterministic, that just means that the conditional probability distribution to sample from is one which has probability 1 for some string, when conditioned on the thing it is to be conditioned on.)
Whether transformer based models are lookup tables or not, not all computable probability distributions over text are. (As, of course, not all computable tasks can be expressed as a simple finite lookup table.)
I don’t know exactly what you mean by “generally understand concepts”, though I suppose
> They don't have enough conceptual power to reliably generate new facts and know that they are facts consistent with the model. That would be an internal representation system.
is describing that somewhat.
And, in that case, if there is any computational process which counts as having “enough conceptual power to generate new facts and know that they are facts consistent with the model”, then, a computable conditional probability distribution over strings conditioned on their prefixes, and therefore also a computable probability distribution over next tokens given all-tokens-so-far , is also (theoretically) capable of that.
And so, it would follow that “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.
> “it only predicts the next token” doesn’t (in principle/theory) preclude it having such an understanding of concepts, unless no computational process ever can.
In my opinion, this is highly reductive and academic. Whether these models are transformers or not, lookup likelihood is not indicative of understanding of concepts in any reasonable way.
If the response to a algebraic equation was based upon probability of tokens in a corpus... and not an actual deterministic application of the rules of algebra, would that response know concepts? Would it be intelligent?
With math, specifically given the unbounded size of the tokens compared to language, it's clear that token prediction is not a useful methodology.
Let's say we're just trying to multiply two integers. Even if a model had Rain Man powers of memorization, and it memorized phone book after phone book of multiplication tables, the probabilistic likelihood model would fail for the very obvious reason that we cannot enumerate (and train on) all the possible outcomes of math and calculate their frequencies. We can however understand and use the concepts of math, which is distinct from their symbolic representation.
> lookup likelihood is not indicative of understanding of concepts in any reasonable way.
Where did I ever say that the thing was doing lookup? I only said it was producing a probability distribution.
Is your claim that all programs are just doing lookup?
> If the response to a algebraic equation was based upon probability of tokens in a corpus...
Ah, I see the confusion. When I say “probability distribution” I do not mean “for each option, an empirical fraction out of all the options, that this particular option appeared in the corpus”. Rather, by “probability distribution”, I mean (in the discrete case) “an assignment of a number which is at least zero and at most one, to each of the options, and such that the sum of the assigned values add up to 1”. I am allowing that this assignment of values is computed (from what is being conditioned on) in any way whatsoever .
If the correct answer is a number, it may compute the entire correct number through some standard means, and then look at however many correct tokens from the number are already present, and assign a probability of 1 to the correct next one, and 0 to all other tokens. If conditioning on a partial answer that has parts wrong, it may use an arbitrary distribution.
It's only because you can essentially put the llms in a simulations that you can have this argument. We can imagine the human brain also in a simulation which we can replay over and over again and adjust various parameters of the physical brain to change the temperature. These sort of arguments can never distinguish between llm and humans.
On that point, I would dispute the premise that "it's impossible to have true language skills without implicitly having a representation of self and environment". I don't see any contradiction between the following two ideas:
1. LLMs inherently lack any form of consciousness, subjective experience, emotions, or will
2. A sufficiently advanced LLM with sufficient compute resources would perform on par with human intelligence at any given task, insofar as the task is applicable to LLMs
> How could a system which produces a single next word based upon a likelihood and and a parameter called a "temperature" have a conceptual model underpinning it? Even theoretically?
You're limiting your view of their capabilities on the output format.
> Not so with LLMs!! Generative LLMs do not have a prior concept available before they start emitting text.
How do you establish that? What do you think of othellogpt? That seems to form an internal world model.
> That the "temperature" can chaotically change the output as the tokens proceed
Changing the temperature forcibly makes the model pick words it thinks fit worse. Of course it changes the output. It's like an improv game with someone shouting "CHANGE!".
Let's make two tiny changes.
One, let's tell a model to use the format
<innerthought>askjdhas</innerthought> as the voice in their head, and <speak>blah</speak> for the output.
Second, let's remove temperature and keep it at 0 so we're not playing a game where we force them to choose different words.
> You're limiting your view of their capabilities on the output format.
The "generation" of strings is related to this output format. It's critical to how they work. Legerdemain has been performed to argue that that's irrelevant, and the real intelligence or concepts are sitting inside the network architecture of the trained model prior to generation. But if that were the case, generation could be done based upon the conceptual representation, and not a syntactical representation token by token. This is not currently the case with LLMs.
I'd turn this question around: if my question is irrelevant, how would one go about building an effective real-world LLM that understand concepts and doesn't use likelihood lookups on a token-per-token basis, but instead generated directly from the conceptual basis? Such an argument, if it exists, would make me very happy.
Please note, I understand that there are prior systems which do this. Generative zero-shot transformer models didn't adopt this approach because it's elegant, but because it is efficient to compute with large data sets and has useful efficacy in generating strings. Some are creative. Some are more "accurate." The temperature parameter can affect which of those cases it selects, if any.
People have short memories, but the people who are both appreciative of LLMs as an engineering feat and critical of their claims of intelligence have been saying this for years. They've said that their token likelihood model is effective for seeing things well covered in the data set. They've been saying that due to the sparsity and structure of human language, large scale approximate compression ("curve fits") would be highly effective and efficient. They've been saying that due to the fact these are large scale fits of a data set, the models would eventually converge to something looking like the known knowledge they're trained on, and not exponentially accelerate in knowledge. All of these predictions have proved to be correct or looking highly likely at this point.
Transformer-based LLMs are a neat algorithmic approach to curve fits. But they are curve fits. Things like cosine transforms in JPEGs, wavelet or Fourier reconstruction in CAT scans, audio signal reconstruction from basis functions are also approximate reconstruction models that function along these lines, albeit in a nice Euclidian space without the generative parts of a transformer. But it was precisely knowledge of how systems like that worked which allowed scientists to understand and predict the limitations of these systems a long time ago. Lots of money and fresh eyes have created a useful computation technique, but these insights have been forgotten. I hope—truly—that progress happens in this space. But the critiques stand and there would be lots to gain by a less-hyped acknowledgement of where we are with these models and the tradeoffs baked into them as a compromise for them to be useful.
> On embodiment - yes, LLMs do not have corporeal experience.
My own thought on this (as someone who believes embodiment is essential) is to consider the rebuttals to Searle's Chinese Room thought experiment.
For now (and the foreseeable future) humans are the embodiment of LLMs. In some sense, we could be seen as playing the role of a centralized AIs nervous system.
Rebuttals of Chinese rooms are also rebuttals of embodiment as a requirement! To say the system of person+books speaks Chinese is to say that good enough emulation of a process has all the qualities of the emulated process, and can substitute for it. Embodiment then cannot be essential, because we could emulate it instead.
The crux of the video game analogy seems to be that when you go close to an object, the resolution starts blurring and the illusion gets broken, and there is a similar thing that happens with LLMs (as of today) as well. This is, so far, reasonable based on daily experience with these models.
The extension of that argument being made in the paper is that a model trained on language tokens spewed by humans is incapable of actually reaching that limit where this illusion will never breakdown in resolution. That also seems reasonable to me. They use the word "languaging" in verb form as opposed to "language" as a noun to express this.
Why are LLMs incapable of reaching that limit? It's very easy to imagine video games getting to that point. We have all the data to see objects right down to the atomic level, which is plenty more than you'd need for a game. It's mostly a matter of compute. Why then should LLMs breakdown if they can at least mimic the smartest humans? We don't need "resolution" beyond that.
That depends if you believe natural language alone is sufficient to fully model reality. Probably not, it can approximate to a high degree, but there is a reason we resort to formal, constructed languages in math or CS to express our ideas.
LLMs aren't trained solely on natural language. They also ingest formal notation from every domain and at every level (from preschool to PhD); they see code and markup in every language even remotely popular. They see various encodings, binary dumps, and nowadays also diagrams. The training data has all that's needed to teach them great many formal languages and how to use them.
If you're talking about machine-learnability of languages then there's two frameworks that are relevant: Language Identification in the Limit and PAC-Learning.
Language Identification in the Limit in short tells us that if there is an automaton equivalent to human language then, if it's at most a regular automaton it can be identified ("learned") by a number of positive only examples approaching infinity, and if it's above regular then a number of negative examples approaching infinity is also needed to identify it. Chomsky based his "Poverty of the Stimulus" argument about linguistic nativism (the built-in "language faculty" of humans) on this result, known as Gold's Result after Mark E. Gold who proved it in the setting of Inductive Inference in 1964. Gold's result is not controversial, but Chomsky's use of it has seen no end of criticism, many from the computational linguistics community (including people in it that have been great teachers to me, without having ever met me, like Charniak, Manning and Schutze, and Jurafsky and Martin) [1].
Those critics generally argue that human language can be learned like everything and anything else: with enough data drawn from a distribution assumed identical to the true distribution of the data in the concept to be learned, and allowing a finite amount of error with a given probability, i.e. under Probably Approximately Correct Learning assumptions, the learning setting introduced by Leslie Valiant in 1984, that replaced Inductive Inference and that serves as the theoretical basis of modern statistical machine learning, in the rare cases where someone goes looking for one. Around the same time that Valiant was describing PAC-Learning, Vapnik and Chervonenkis were developing their statistical learning theory behind the Iron Curtain and if you take a machine learning course in school you'll learn about the VC Dimension and wonder what's that got to do with AI and LLMs.
The big question is how relevant is all this to a) human language and b) learning human language with an LLM. Is there an automaton that is equivalent to human language? Is human language PAC-learnable (i.e. from a polynomial number of examples)? There must be some literature on this in the linguistics community, possibly the cognitive science community. I don't see these questions asked or answered in machine learning.
Rather, in machine learning people seem to assume that if we throw enough data and compute at a problem it must eventually go away, just like generals of old believed that if they sacrifice enough men in a desperate assault they will eventually take Elevation No. 4975 [2]. That's of course ignoring all the cases in the past where throwing a lot of data and compute at a problem either failed completely -which we usually don't hear anything about because nobody publishes negative results, ever- or gave decidedly mixed results, or hit diminishing returns; as a big example see DeepMind's championing of Deep Reinforcement Learning as an approach to real world autonomous behaviour, based on the success of the approach in virtual environments. To be clear, that hasn't worked out and DeepMind (and everyone else) has so far failed to follow the glory of AlphaGo and kin with a real-world agent.
So in short, yeah, there's a lot to say that we may never have enough data and compute to achieve a good enough approximation of human linguistic ability with a large language model, or something even larger, bigger, stronger, deeper, etc.
There are many finite problems that absolutely do not admit finite solutions. Full stop.
I think the deeper point of the paper is that you simply cannot generate an intelligent entity by just looking at recorded language. You can create a dictionary, and a map - but one must not mistake this map for the territory.
The human brain is a finite solution, so we already have an existence proof. That means a lot for our confidence in the solvability of this kind of problem.
It is also not universally impossible to reconstruct a function of finite complexity from only samples of its inputs and outputs. It is sometimes possible to draw a map that is an exact replica of the territory.
Trying to recreate a "human brain" is an absolutely terrible idea - and is not something we should even attempt. The consequences of success are terrible.
They're not really trying to create a human brain, so far as I can tell. They're trying to create an oracle, by feeding it all existing human utterances. This is certainly not going to succeed, since the truth is not measurable post-facto from these utterances.
The claim regarding reconstructing functions from samples of its ins and outs is false. It's false both mathematically, where "finite complexity" doesn't really even have a rigorous definition - and metaphorically too.
Sometimes maps are the territory, especially when the territory that is being mapped is itself a map. An accurate map of a map can be a copy of the map that it maps. The human brain's concept of reality is not reality, it's a map of reality. A function trained to predict human outputs can itself contain a map which is arbitrarily similar to the map that a human carries in their own head.
(Finite complexity is rigorously definable, it's just that the definition is domain-specific).
> . I feel similarly about this as to what I've read of Chalmers - I agree with pretty much all of the conclusions, but I don't feel like the text would convince me of those conclusions if I disagreed;
my limited experience of reading Chalmers is that he doesn't actually present evidence - he goes on a meandering rant and then claims to have proved things that he didn't even cover. it was the most infuriating read of my life, I heavily annotated two chapters and then finally gave up and donated the book.
I haven't read any Chalmers so I can't comment on his writing style. I have seen him in several videos on discussion panels and on podcasts.
One thing I appreciate is he often states his premises, or what modern philosophers seem to call "commitments". I wouldn't go so far as to say he uses air-tight logic to reason from these premises/commitments to conclusions - but at the least his reasoning doesn't seem to stray too far from those commitments.
I think it would be fair to argue that not all of his commitments are backed by physical evidence (and perhaps some of them could be argued to go against some physical evidence). And so you are free to reject his commitments and therefore reject his conclusions.
In fact, I think the value of philosophers like Chalmers is less in their specific commitments and conclusions and more in their framing of questions. It can be useful to list out his commitments and find out where you stand on each of them, and then to do your own reasoning using logic to see what conclusions your own set of commitments forces you into.
yeah while reading the book he would keep saying things that are factually wrong or just state that things are impossible, basically he builds the conclusion into the premises and then discovers the conclusions like he just defended them.
>> Again, that conclusion feels wrong to me... but if I'm being honest with myself, I can't point to why, other than to point at some form of dualism or spirituality as the escape hatch.
I like how Chomsky deals with it who doesn't have any spirituality at all, the big degenerate materialist:
As far as I can see all of this [he's speaking about the Loebner Prize and the Turing test in general] is entirely pointless. It's like asking how we can determine empirically whether an aeroplane can fly the answer being if it can fool someone into thinking that it's an eagle under some conditions.
He's right, you know. It should be possible to tell whether something is intelligent just as easily as it is to say that something is flying. If there are endless arguments about it, then it's probably not intelligent (yet). Conversely, if everyone can agree it is intelligent then it probably is.
Because it's not easy to tell whether something is flying. Definitions like that fall apart every time we encounter something out of the ordinary. If you take the criterion of "there's no discussion about it", then you're limiting the definition to that which is familiar, not that which is interesting.
Is an ekranoplan flying? Is an orbiting spaceship flying? Is a hovercraft flying? Is a chicken flapping its wings over a fence flying?
Your criterion would suggest the answer of "no" to any of those cases, even though those cover much of the same use cases as flying, and possibly some new, more interesting ones.
And I don't think an AGI must be limited to the familiar notion of intelligence to be considered an AGI, or, at the very least, to open up avenues that were closed before.
> Your criterion would suggest the answer of "no" to any of those cases, even though those cover much of the same use cases as flying, and possibly some new, more interesting ones.
Is it a problem though? Their existence are unrelated to how we categorize them.
That matters only in communication. “if everybody agrees” lowers/removes the risk of miscommunication.
If “hovercraft is flying” for you, but not for 50% the world, it makes it somewhat more difficult to communicate.
(Easily solved with some qualifications, but that requires admission the questionability of “hovercraft is flying”)
> you're limiting the definition to that which is familiar, not that which is interesting.
You made an Interesting point - good food for thought.
Counterpoint: It seems natural and useful that only similar things get to use same word.
> And I don't think an AGI must be limited …
Could you expand on why does it matter and what would be impacted by such lenient (or strict) classification?
I think it matters merely by the way we set our expectations relative to what is going to come - and what has come already. I'm feeling an undercurrent of thought that is implying: this is not X (intelligence, understanding, whatever), so there's no need to consider it seriously.
> I'm feeling an undercurrent of thought that is implying: this is not X, so there's no need to consider it seriously.
True. I doubt that field experts are directly affected by the naming, but indirect effect might come via less knowledgable (AI wise) financial decision makers.
I see a risk that those decision makers (and society) would be mislead if they were promised AGI (based on their “strict” understanding, what’s in the movies), but received AGI (based on “relaxed” meaning). Informed consent is usually good.
Though surely that can be resolved with more public discourse; maybe “relaxed” version will become the default expectation.
There are going to be gray areas of course, but the point I'm making is that if it's hard to argue something isn't flying (respectively, intelligent) then it's probably flying (resp. intelligent). If it's hard to tell then it's probably not. I'm suggesting that intelligence, like flying, should be very immediately obvious.
For example, you can't miss the fact that a five-year old child is intelligent and you can't miss the fact that a stone is not. There may be all sorts of things in between for which we can't be sure, or whose intelligence depends on definition, or point of view, etc. but when something is intelligent then it should leave us no doubt that it is. Or, if you want to see it this way: if something is as intelligent as a five-year old child then it should leave us no doubt that it is.
I'm basically arguing for placing the bar high enough that when it is passed, we can be fairly certain we're not mistaken.
>> I can't disagree more. Or maybe I actually agree.
Except for "Cartesian blindness" (an interesting term) the situations where you say we wouldn't recognise intelligence are so far fantastical in the sense that they require a kind of intelligence that we can only imagine might exist outside the realm of our experience.
But why should current debate have anything to do with all those situations? By analogy, suppose there exists a race of alien birds on a faraway planet that can fly in a way that we wouldn't recognise as flying. Perhaps they have an antigravity gland or poop exotic matter that sticks to their butts and lets them move around without touching the ground. Does that affect our ability to recognise flight when we see it on Earth? I don't think so. When you see e.g. an eagle, fly, you know it's flying, regardless of whether anything else is, or might be conceived as, flying, or not.
Equally, does the possibility of an alien intelligence like no intelligence we have ever seen before make any difference to how we recognise intelligence here on Earth and right now? The debate is about the intelligence of computers. Should we consider the possibility that computers are about to develop an alien kind of intelligence like no intelligence we have ever seen? I know there's such a current of thought in various circles but it doesn't seem to be based on anything but wishful thinking (or dreadful thinking?).
As to "Cartesian blindness" and racism etc - science is self-correcting. Even if we spend some time confused about what is and isn't intelligence, we get there in the end. The question is how we identify intelligence in the here and now, with what we know and understand about the world in the present.
I think it does matter that we admit other kinds of flying or intelligence because then we allow the possibility of having a blind spot.
We already have alien intelligences on this planet that are not often getting invoked in discussions of the possible unfamiliarity of the difference of computers. Let's take the octopus distributed thinking or the hive's collective intelligence. They could shine a light on how computers do or don't work, yet how many people are holding the torch?
I think self-correcting of science can be nice, but the acknowledge of racism operates on cultural scales of decades and centuries. The corrected discussions are likely to appear long after computer thinking breakthroughs which happen on the scale of years to decades. So you're right, but that doesn't matter today. But I admit I'm not sure if that's what you meant here.
>> They could shine a light on how computers do or don't work, yet how many people are holding the torch?
Scant few, unfortunately. I think in the AI community there is a general agreement that bees and ants have some kind of intelligence but I don't see anyone doing much about it, like using it as a model for AI (although note e.g. Ant Colony Optimisation and other biologically inspired algorithms). For example, I have seriously considered applying for funding for a project titled "Arthropod-Like Intelligence" that would seek to create an artificial agent (a robot of some sort) with autonomous capabilities at the level of a spider. Unfortunately, every time I start to write up the proposal I immediately imagine the ridicule it -and I- would be subjected to by any funding committee and fellow researchers in AI and I give up. So yes, the current agreement about what counts as intelligence is limiting to the advancement of science. But there is some progress, slow as it is- thanks to people bolder than myself.
With octopi I think it's a different matter. People generally don't get to see how octopi behave. Once they're shown examples, like in videos etc, I think most are convinced. So it's more an element of surprise, rather than a real resistance to the idea. I think the debate is more on different aspects of let's say broader cognition, like self-awareness, the ability to feel pain, etc. Again I think the safe bet here is to adopt the default position that all animals are intelligent, self-aware and can feel pain, since there's no reason that humans should be special in that respect. But, really, it's not my field so my opinion doesn't matter in the grand scheme of things.
>> The corrected discussions are likely to appear long after computer thinking breakthroughs which happen on the scale of years to decades.
Yeah, unfortunately science takes a very long time. But it works in the end and there's no better way we know. I don't guess I have to argue about that though.
>but when something is intelligent then it should leave us no doubt that it is.
Not that long ago, a whole lot of humans (the majority in some continents) asserted other a group of other humans were not intelligent so strongly they purchased them as property and treated them worse than even farm animals, so i think you can basically throw this one out the window.
Who said that slaves weren't intelligent? I know they were probably treated as subhuman, but as not having intelligence? Like a rock or a piece of wood? I don't believe that was the case.
What probably happened, and still happens, is that some people underestimate the intelligence of other people and think they're not as intelligent as themselves, not that they don't have intelligence.
I'm not sure why a rock or piece of wood is the bar here. The fact is nearly no-one in the Americas would call or describe Sub-Saharan Africans as Intelligent.
Thomas Jefferson who generally seemed to oppose slavery says this in his book, "Notes on the State of Virginia" (1785):
"In general, their existence appears to participate more of sensation than reflection."
"Comparing them by their faculties of memory, reason, and imagination, it appears to me that in memory they are equal to the whites; in reason much inferior, as I think one could scarcely be found capable of tracing and comprehending the investigations of Euclid; and that in imagination they are dull, tasteless, and anomalous."
Doesn't even sound like he's talking about the same species. We can scarcely agree on the intelligence of other humans. Let's not even get into the topic of animals.
So the idea that intelligence will manifest and we'll all just see it and agree...Yeah No.
>> "Comparing them by their faculties of memory, reason, and imagination, it appears to me that in memory they are equal to the whites; in reason much inferior, as I think one could scarcely be found capable of tracing and comprehending the investigations of Euclid; and that in imagination they are dull, tasteless, and anomalous."
He's saying they're of inferior "reasoning" (but equal in memory). That's hardly saying they're not intelligent or recognising them as not intelligent.
I don't think you will find anyone saying what you think someone's saying. "Not intelligent" is indeed a rock or a piece of wood. "Inferior" is a different concept.
"Inferior", "sub-human", "feeble-minded", sure, people keep saying things like that for other humans. But, "not intelligent" in the sense of "can't use language", "can't use tools", "can't tie own shoelaces", that would be very hard to maintain in the face of very easy to make observations. Which, you know, is exactly my point.
Are you using "intelligent" to mean "very smart" or something similar? That's not what I'm saying. I'm really just pointing to the difference between 0 intelligence, like a rock, and undeniable intelligence, like a 5-year old child.
Well you say, if we're arguing whether it's intelligent or not, it probably isn't.
Ok i guess the argument then...So what are these obvious signs of Intelligence the machines in question are not displaying?
From the few you gave as example,
"can't use language" - Obviously not a problem
"can't use tools" - Digital tools, Some physical tools is certainly possible today
"can't tie own shoelaces" - Don't know any animal that can do this and animals pass your bar of intelligence.
Sure we're having endless arguments about the Intelligence of Sota LLMs today but almost none of it has anything to do with observable, testable capabilities.
In other words, the problem of these debates isn't that people are disagreeing seeing both the bird and plane have willful sustained airtime (aka flying). It's that they've seen both and decided for entirely arbitrary reasons to label one, "real flying" and the other, "fake flying".
And if you then ask the reasonable question, "Which properties separate the so called 'real flight' from 'fake flight' and how do we test for it?", then nobody seems to have any clue.
Above you're arguing for one side of the debate. I'm arguing that as long as there is a debate the safest bet is to adopt the default, null position: it's not flying; it's not intelligent.
To summarise my argument again: if we can all agree that something is intelligent, then it probably is. If we can't, then it probably isn't, especially, I would add, if there is substantial disagreement.
One motivation for this is to avoid an unending quest for the right definition. Once we can all agree what (artificial) intelligence is, it will be much easier to agree to a definition. But while we're all looking at the same thing and can't agree on what it is, how can we agree on a definition, and then use it to support one or the other side of the debate?
Take human intelligence. I don't think anyone seriously doubts that humans are intelligent, except perhaps for the purpose of being contrary, or being a philosopher[1]. We may not know what "intelligent" means, but we can agree that whatever it is, it's something that humans have. Yes, that's an arbitrary decision, but as long as we can agree on it, it doesn't matter that it's arbitrary: it matters that there is consensus, based on common experience, and we all know what we're talking about. We can't get to a definition of a phenomenon before we all agree that it is there: we all agreed that fire is fire, water is water and ice is ice, a long time before we had any sort of commonly agreed definitions of them.
"I can't put my finger on it but I know it when I see it"- that's a very simple way to avoid discussions that lead nowhere.
And there are way too many discussion that lead nowhere in AI.
Edit: btw, in research the discussions are much more focused. E.g. "LLMs can plan": that's a concrete, testable claim. No reason to define intelligence to resolve it. Much more progress is made this way than by endlessly chasing our tails about what is or isn't intelligent.
___________
[1] The purpose of science is to pose questions and answer them. The purpose of philosophy is to pose questions and question them.
How much of a consensus is there though, when as soon as you start digging at the edges, the consensus dissolves? And the edges are boundaries of the pool of the experiences we know. What about the experiences we don't know? What if we start considering not only agents with a physical body, but also add a dimension of agents without one (considering that the ways we recognize intelligence in animals is by the way they move)? Then our entire discussion jumps to the border, and the consensus is nowhere to be found, making the heuristic - again - biased towards the familiar.
No, I think we need a better heuristic than consensus. The consequences of a bad heuristics are, at best, wasting time, but at worst, pulling the trigger on something heinous because it was insufficiently understood.
I don't think that hideous things, like eugenics or Nazi racism, happen because of a lack of understanding. They happen because people have their heads up their butts with megalomaniac ideas about the world and their place in it. The "Master Race" was not a misunderstanding, it was some people wanting to be better than everyone else and making up, out of whole cloth -and no empirical evidence at all- that they were. There certainly wasn't universal consensus about it btw, just between Nazis.
I agree that there are dangers in trusting consensus as a mechanism for advancement of knowledge, and I see your point about risking losing focus on the border. But I think it is also very important to have a common body of knowledge that we can all agree on. Take flying again: nobody disputes the fact that planes are flying (again, nobody who isn't just being contrary). We can work out the gray areas in time and they don't necessarily make progress impossible. E.g. maybe we can't agree on whether hovercrafts fly, but we can still make them and use them.
So there is a limiting effect, like I point out in my other comment, but it is not a hard lock on progress. And I think that avoiding endless philosophical discussions is a big advantage.
If there is a better heuristic then I'm happy to embrace it btw. But, is there?
Hideous things can be done by evil people, but also accidentally by well-meaning people when they don't realize what the stakes are. How many people stop being racist only because they leave the environment where racism was prevalent and they get exposed to a different heuristic on human worth?
> avoid acting heinously by the standards of any reasonable ethical principle that draws a significant proportion of well-informed, thoughtful theorists
Thanks for the link. I generally agree with the premises, though not necessarily the utilitarian treatment. For me it goes without saying that if we create AI's (that we can all agree are) as intelligent as humans we will have to treat them just as we treat humans. Which kind of defeats their purpose: most people who want to create AI think of them as superhuman machine slaves as far as I've seen.
Everyone seems to want to discuss whether there’s some fundamental qualia preventing my toaster from being an AGI, but no one is interested in acknowledging that my toaster isn’t an AGI. Maybe a larger toaster would be an AGI? Or one with more precise toastiness controls? One with more wattage?
The only thing this paper prove is that folks at Trinity College in Dublin are poor, envious anthropocentric drunkards, ready to throw every argument to defend their crown of creating, without actually understanding the linguistics concepts they use to make their argument.
Not much new here. The basic criticism is that LLMs are not embodied; they have no interaction with the real world. The same criticism can be applied to most office work.
Useful insight: "We (humans) are always doing more than one thing." This is in the sense of language output having goals for the speaker, not just delivering information. This is related to the problem of LLMs losing the thread of a conversation. Probably the only reasonably new concept in this paper.
Standard rant: "Humans are not brains that exist in a vat..."
"LLMs ... have nothing at stake." Arguable, in that some LLMs are trained using punishment. Which seems to have strong side effects. The undesirable behavior is suppressed, but so is much other behavior. That's rather human-like.
"LLMs Don’t Algospeak". The author means using word choices to get past dumb censorship algorithms. That's probably do-able, if anybody cares.
The optimization process adjusts the weights of a computational graph until the numeric outputs align with some baseline statistics of a large data set. There is no "punishment" or "reward", gradient descent isn't even necessary as there are methods for modifying the weights in other ways and the optimization still converges to a desired distribution which people claim is "intelligent".
The converse is that people are "just" statistical distributions of the signals produced by them but I don't know if there are people who claim they are nothing more than statistical distributions.
I think people are confused because they do not really understand how software and computers work. I'd say they should learn some computability theory to gain some clarity but I doubt they'd listen.
If you really want to phrase it that way, organisms like us are "just" distributions of genes that have been pushed this way and that by natural selection until they converged to something we consider intelligent (humans).
It's pretty clear that these optimisation processes lead to emergent behaviour, both in ML and in the natural sciences. Computability theory isn't really relevant here.
I don't even know where to begin to address your confusion. Without computability theory there are no computers, no operating systems, no networks, no compilers, and no high level frameworks for "AI".
Well, if you want to address my "confusion" then pick something and start there =)
That is patently false - most of those things are firmly in the realm of engineering, especially these days. Mathematics is good for grounding intuition though. But why is this relevant to the OP?
There is no reason to do any of that because according to your own logic AI can do all of it. You really should sit down and ponder what exactly you get out of equating Turing machines with human intelligence.
Sorry, I edited my reply because I decided going down that rabbit hole wasn't worth it. Didn't expect you to reply immediately.
I'm not equating anything here, just pointing out that the fact that AI runs in software isn't a knockdown argument against anything. And computability theory certainly has nothing useful to say in that regard.
Well, you know, elaborate and we can have a productive discussion. The way you keep appealing to computability theory as a black box makes me think you haven't actually studied that much of it.
Good summary of some of the main "theoretical" criticism of LLMs but I feel that it's a bit dated and ignores the recent trend of iterative post-training, especially with human feedback. Major chatbots are no doubt being iteratively refined on the feedback from users i.e. interaction feedback, RLHF, RLAIF. So ChatGPT could fall within the sort of "enactive" perspective on language and definitely goes beyond the issues of static datasets and data completeness.
Sidenote: the authors make a mistake when citing Wittgenstein to find similarity between humans and LLMs. Language modelling on a static dataset is mostly not a language game (see Bender and Koller's section on distributional semantics and caveats on learning meaning from "control codes")
it does. that's what the "direct preference" part of DPO means. you just avoid training an explicit reward model on it like in rlhf and instead directly optimize for log probability of preferred vs dispreferred responses
What is it called when humans interact with a model through lengthy exchanges (mostly humans correcting the model’s responses to a posed question to the model, mostly through chat and labeling each statement by the model as correct or not), and then all of that text (possibly with some editing) is fed to another model to train that higher model?
I don’t think that process has a specific name. It’s just how training these models works.
Conversations you have with like chatgpt are likely stored, then sorted through somehow, then added to an ever growing dataset of conversations that would be used to train entirely new models.
The authors of this paper are just another instance of the AI hype being used by people who have no connection to it, to attract some kind of attention.
"Here is what we think about this current hot topic; please read our stuff and cite generously ..."
> Language completeness assumes that a distinct and complete thing such as `a natural language' exists, the essential characteristics of which can be effectively and comprehensively modelled by an LLM
Replace "LLM" by "linguistics". Same thing.
> The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data.
That's all that a baby has, who becomes a native speaker of their surrounding language. Language acquisition does not imply totality of data. Not every native speaker recognizes exactly the same vocabulary and exactly the same set of grammar rules.
Babies have feedback and interaction with someone speaking to them. Would they learn to speak if you just dumped them in front of a TV and never spoke to them? I'm not sure.
But anyway I agree with you. This is just a confused HN comment in paper form.
I personally don’t get much value out of the paper, but it is orders of magnitude more substantive and thoughtful than a median “confused Hacker News comment”.
> Babies have feedback and interaction with someone speaking to them. Would they learn to speak if you just dumped them in front of a TV and never spoke to them? I'm not sure.
Feedback and interaction is not vital for acquisition for secondary language learning at least according to one theory.
And if that’s good enough for adults it might be good enough for sponge-brain babies.
They are two researchers/assistant professors working with cognitive science, psychology, and trustworthy AI. The paper is peer reviewed and has been accepted for publication in the Journal of Language Sciences.
You should publish your critique of their research in that same journal.
P.s. if you find any grave mistakes, you can contact the editor in chief, who happens to be a linguist.
Their critique is written here, in plain english. Any fault with it you can just mention. The "I won't read your comment unless you get X journal to publish it" seems really counterproductive. Presumably even the great Journal of Language Sciences is not above making mistakes or publishing things that are not perfect.
The "efficient journal hypothesis" -- if something is written in a paper in a journal, then it's impossible for anyone to know any better, since if they knew better, they would already have published the correction in a journal.
The parent comment I responded to is speculative and does not argue on the merits. We can do better here.
Are there people who ride the hype wave of AI? Sure.
But how can you tell from where you sit? How do you come to such a judgment? Are you being thoughtful and rational?
Have you considered an alternative explanation? I think the odds are much greater that the authors’ academic roots/training is at odds with what you think is productive. (This is what I think, BTW. I found the paper to be a waste of my time. Perhaps others can get value from it?)
But I don’t pretend to know the authors’ motivations, nor will I cast aspersions on them.
When one casts shade on a person like the comment above did, one invites and deserves this level of criticism.
That's a lot of thinking they've done about LLMs, but how much did they actually try LLMs? I have long threads where ChatGPT refine solutions to coding problems. Their example of losing the thread after printing a tiny list of 10 philosophers seems really outdated. Also it seems LLMs utilize nested contexts as well, for example when it can break it' own rules while telling a story or speaking hypothetically.
For a paper submitted on July 11, 2024, and with several references to other 2024 publications, it is indeed strange that it gives ChatGPT output from April 2023 to demonstrate that “LLMs lose the thread of a conversation with inhuman ease, as outputs are generated in response to prompts rather than a consistent, shared dialogue” (Figure 1). I have had many consistent, shared dialogues with recent versions of ChatGPT and Claude without any loss of conversation thread even after many back-and-forths.
Most LLM critics (and singularity-is-near influencers) don't actually use the systems enough to have relevant opinions about them. The only really good sources of truth is the chatbot-arena from lmsys and the comment section of r/localllama (I'm quoting Karpathy), both are "wisdom of the crowd" and often the crowd on r/localllama is getting that wisdom by spending hours with one hand on the keyboard and another under their clothes.
There is a lot of frustration here over what appears to be essentially this claim:
> ...we argue that it is possible to offer generous interpretations of some aspects of LLM engineering to find parallels with human language learning. However, in the majority of key aspects of language learning and use, most specifically in the various kinds of linguistic agency exhibited by human beings, these small apparent comparisons do little to balance what are much more deep-rooted contrasts.
Now, why is this so hard to stomach? This is the argument of this paper. To feel like this extremely general claim is something you have to argue against means you believe in a fundamental similarity between what our linguistic agency and the model. But is embodied human agency something that you really need the LLMs to have right now? Why? What are the stakes here? The ones actually related to the argument at hand?
This ultimately not that strong of a claim! To the point that its almost vacuous... Of course the LLM will never learn the stove is "hot" like you did when you were a curious child. How can this still be too much to admit for someone? What is lost?
It makes me feel little crazy here that people constantly jump over the text at hand whenever something gets a little too philosophical, and the arguments become long pseudo-theories that aren't relevant to argument.
“Enactivism” really? I wonder if these complaints will continue as LLMs see wider adoption, the old first they ignore you, then they ridicule you, then they fight you… trope that is halfways accurate. Any field that focuses on building theories on top of theories is in for a bad time.
Where I work, there's a somewhat haphazardly divided org structure, where my team has some responsibility to answer the executives demands for "use AI to help our core business". So we applied off-the-shelf models to extract structured context from mostly unstructured text - effectively a data engineering job - and thereby support analytics and create more dashboards for the execs to mull over.
Another team, with a similar role in a different part of the org has jumped (feet first) into optimizing large language models to turn them into agents, without consulting the business about whether they need such things. RAG, LoRA and all this optimization is well and good, but this engineering focus has found no actual application, expect wasting several million bucks hiring staff to do something nobody wants.
How would the authors consider a paralyzed individual who can only move their eyes since birth? That person can learn the same concepts as other humans and communicate as richly (using only their eyes) as other humans. Clearly, the paper is viewing the problem very narrowly.
I didn’t want to Google it for you because it always makes me sad but things like spina bifida and moebius syndrome exist. Not everyone gets to begin life healthy.
I'm more or less a layperson when it comes to LLMs and this nascent concept of AI, but there's one argument that I keep seeing that I feel like I understand, even without a thorough fluency with the underlying technology. I know that neural nets, and the mechanisms LLMs employ to train and form relational connections, can plausibly be compared to how synapses form signal paths between neurons. I can see how that makes intuitive sense.
I'm struggling to articulate my cognitive dissonance here, but is there any empirical evidence that LLMs, or their underlying machine learning technology, share anything at all with biological consciousness beyond a convenient metaphor for describing "neural networks" using terms borrowed from neuroscience? I don't know that it necessarily follows that just because something was inspired by, or is somehow mimicking, the structure of the brain and its basic elements, that it should necessarily relate to its modeled reality in any literal way, let alone provide a sufficient basis for instantiating a phenomena we frankly know very little about. Not for nothing, but our models naturally cannot replicate any biological functions we do not fully understand. We haven't managed to reproduce biological tissues that are exponentially less complex than the brain, are we really claiming that we're just jumping straight past lab-grown t-bones to intelligent minds?
I'm sure most of the people reading this will have seen Matt Parker's videos where they "teach" matchbooks to win a game against humans. Is anyone suggesting those matchbooks, given infinite time and repetition, would eventually spark emergent consciousness?
> The argument would be that that conceptual model is encoded in the intermediate-layer parameters of the model, in a different but analogous way to how it's encoded in the graph and chemical structure of your neurons.
Sorry if I have misinterpreted anyone. I honestly thought all the "neuron" and "synapse" references were handy metaphors to explain otherwise complex computations that resemble this conceptual idea of how our brains work. But it reads a lot like some of the folks in this thread believe it's much more than metaphors, but rather a literal analog.
I don't think anyone in research actually believes this. Note that the whole idea behind claiming "scaling laws" will infinitely improve these models is a funding strategy rather than a research one. None of these folks think human-like consciousness will "rise" from this effort, even though they veil it to continue the hype-cycle. I guarantee all these firms are desperately looking for architectural breakthroughs, even while they wax poetic about scaling laws, they know there is a bottleneck ahead.
Notice how LeCun is the only researcher being honest about this in a public fashion. Meta is committed to AI already and will at least match the spend of competitors anyway, so he doesn't have as much pressure to try and convince investors that this rabbit whole is deeper.
Don't get me wrong, LLMs are a tremendous improvement on knowledge compression and distillation, but it's still unreliable enough that old school search is likely a superior method nonetheless.
Put aside consciousness or hype or investment. Look at the results; LLMs are well beyond old-school search in many ways. Sure, they are flawed in someways. Previous paradigms for search, were also flawed in their own ways.
Look at the arc of NLP. Large language models fit the pattern. One could even say that their development (next token prediction with a powerful function approximator) is obvious in hindsight.
Honestly I don't disagree, I just think that humans tend to anthropomorphize to such a high extent that there is a fair bit of hyperbole promoting LLMs as more than they are. It's my opinion that the big flaws LLMs currently present aren't going to be overcome by scaling alone.
Scaling existing architectures (inference I mean) will probably help a lot. Combine that with better training and hybrid architectures, and I personally expect to see continued improvement.
However, given the hype cycle,
combined with broad levels of ignorance of how LLMs work, it is an open question if even amazing progress will impress people anymore.
I'm less concerned with people's perception and strictly concerned with value. If we were to define value as the number of things that can be automated or severely improved by the technology or its future versions, there is a misalignment between value and perceived value.
The value is lower than perceived because there is an assumption that what's preventing higher value delivery from the investment is for the models to get better at generating responses from prompts. But there are two issues with this position.
1. LLMs still require plenty of assistance where they are writing production ready code and making function-calls, especially if the original API wasn't designed with LLMs in mind. Unless there is a leap in architecture that's going to make all the tools we have, including non-API ones easily accessible by the models to interact with, the amount of glue code required to make it all work increases the potential for features that users can use, but not necessarily deliver higher value to developers in that a lay person still probably can't develop software with an LLM copilot in toe. So yes, no one is going to be impressed even if we see models improve further on benchmarking.
2. Long-Horizon goals. Long-term research or even project management requires interdisciplinary understanding of how all the goals around success relate to each other and most importantly how to assess if an outcome is leading to a goal accomplishment or not. There isn't an architectural foundation for the models to be grounded in a reality that presupposes these abilities.
What I fail to see, is how improving next token prediction will materially move the needle on these other aspects of intelligence that aren't necessarily related to generating an output or a series of outputs orchestrated over an evolving set of requirements.
Honestly lI think that the LLM portion of the human brain has likely be surpassed by existing models
There isn’t really any reason biological neurons should relate to their modelled reality, what does a single cell care about poetry or even simple things like a chair?
I find discussions of consciousness even more taxing than religion, free will, or politics.
With very careful discussion, there are some really interesting concepts in play. This paper however does not strike me as worth most people’s time. Especially not regarding consciousness.
oh what a kettle of worms here... Now the mind must consider "repetitive speech under pressure and in formal situations" in contrast and comparison to "limited mechanical ability to produce grammatic sequences of well-known words" .. where is the boundary there?
I am a fan of this paper, warts and all ! (and the paper summary paragraph contained some atrocious grammar btw)
Why assume you "know" what language is? Like there is a study backed insight on the ultimate definition of language?
it's the same as saying "oh, it's not 'a,b,c' its 'x,y,z'", which makes you as dogmatic as the one you critique.
This is absurd.
On embodiment - yes, LLMs do not have corporeal experience. But it's not obvious that this means that they cannot, a priori, have an "internal" concept of reality, or that it's impossible to gain such an understanding from text. The argument feels circular: LLMs are similar to a fake "video game" world because they aren't real people - therefore, it's wrong to think that they could be real people? And the other half of the argument is that because LLMs can only see text, they're missing out on the wider world of non-textual communication; but then, does that mean that human writing is not "real" language? This argument feels especially weak in the face of multi-modal models that are in fact able to "see" and "hear".
The other flavor of argument here is that LLM behavior is empirically non-human - e.g., the argument about not asking for clarification. But that only means that they aren't currently matching humans, not that they couldn't.
Basically all of these arguments feel like they fall down to the strongest counterargument I see proposed by LLM-believers, which is that sufficiently advanced mimicry is not only indistinguishable from the real thing, but at the limit in fact is the real thing. If we say that it's impossible to have true language skills without implicitly having a representation of self and environment, and then we see an entity with what appears to be true language skills, we should conclude that that entity must contain within it a representation of self and environment. That argument doesn't rely on any assumptions about the mechanism of representation other than a reliance on physicalism. Looking at it from the other direction, if you assume that all that it means to "be human" is encapsulated in the entropy of a human body, then that concept is necessarily describable with finite entropy. Therefore, by extension, there must be some number of parameters and some model architecture that completely encode that entropy. Questions like whether LLMs are the perfect architecture or whether the number of parameters required is a number that can be practically stored on human-manufacturable media are engineering questions, not philosophical ones: finite problems admit finite solutions, full stop.
Again, that conclusion feels wrong to me... but if I'm being honest with myself, I can't point to why, other than to point at some form of dualism or spirituality as the escape hatch.