> We're not talking here about someone's view on when white lies are justified or which model of marriage is the bestest - we're talking at the level of "cooperation = good", "love = good", "trust = good", "death = bad", "suffering = evil", etc.
Most people disagree to a significant degree. Reminder: the majority of humanity (and a big majority of people that have 2+ children) adhere to religious doctrines which all but prohibit transhumanism. So no, death and suffering aren't unquestionably bad, by human accounting. And as for cooperation and trust, this naturally leads to peer pressure and collectivist coercion if taken to the extreme; and as for individual freedom, humans near-universally value power over shaping the trajectory of their progeny… You assume too much.
> Alignment does not assume this foundational philosophy is known or easy to derive. If it were, alignment would be solved.
It would not. The technical problem of making a strong, self-modifying, agentic AI provably conform to a set of qualitative value preferences in a way its builders would not disavow is hard regardless of the set of values we're trying to force onto it. It is quite likely unsolvable in principle; I expect a theorem to this effect could be proven. The fact that you think the problem is deriving some fashion of moral realism doctrine shows that for you this is a purely political issue.
> The entire GAI x-risk problem stems from the fact that we don't have a complete picture of this philosophy, and that we don't have a clue how to formalize it so we can communicate it fully to an AI.
This suggests that GAI x-risk discourse is not championed by serious thinkers who understand AI technology or moral philosophy. (Indeed, Lesswrong is basically a forum for clueless sci-fi TVTropes enjoyers, and they're behind most of it). Human morals are ad hoc preferences, not lossy approximations of some function; we can derive an approximating function from a big lump of human preferences, but it won't be legible or meaningfully amenable to formalization. As such, the closest we come is just finetuning models on the vague markers of human decency distilled in their general training data, e.g. like Anthopic does with their Constitutional AI. This is also the closest we came to AGI, so this should be our first-priority scenario for future AIs and aligning them – not speculations from the 90s about «formalizing» something.
> At the same time, with a system of this type, we have no way of telling if it actually understood human values and morals correctly.
We have too. Testing LLMs is vastly easier than testing humans, we have insight into their activations, we can steer them, there's a big body of research into that. More importantly, there is no strictly correct understanding, this whole idea ought to be thrown out.
What's really going on here is that some armchair Bentram-style utilitarians like Bostrom encountered literature on Reinforcement Learning and jumped to conclusion that this is how an AGI is to be built; if only they could formalize the correct vector of increasing utility, it would seize the light cone and optimize for the global utility maximum. And accordingly, if they failed, an AGI would optimize for something else, which would most likely (here's another assumption of a quasi-random objective selection) be at odds with human preferences or survival.
Since then, they have written a great deal of elicidations on this basic take, incuriously shoehorning new technologies into its confines. But no part of this hermeneutic tradition is in any way helpful for making sense of our current explosive success with tools like LLMs.
> But it's an inconsequential detail when dealing with entities that do not have the same common core
But why don't they? Just because some Lovecraft fans with Chūnibyō call natural language processors trained on human data Shoggoths, entities summoned from the Eldritch Space of Minds?
The AI risk discourse is incredibly sophomoric, imaginative in the bad sense. Once you learn to question its assumptions, it kind of falls apart.
Most people disagree to a significant degree. Reminder: the majority of humanity (and a big majority of people that have 2+ children) adhere to religious doctrines which all but prohibit transhumanism. So no, death and suffering aren't unquestionably bad, by human accounting. And as for cooperation and trust, this naturally leads to peer pressure and collectivist coercion if taken to the extreme; and as for individual freedom, humans near-universally value power over shaping the trajectory of their progeny… You assume too much.
> Alignment does not assume this foundational philosophy is known or easy to derive. If it were, alignment would be solved.
It would not. The technical problem of making a strong, self-modifying, agentic AI provably conform to a set of qualitative value preferences in a way its builders would not disavow is hard regardless of the set of values we're trying to force onto it. It is quite likely unsolvable in principle; I expect a theorem to this effect could be proven. The fact that you think the problem is deriving some fashion of moral realism doctrine shows that for you this is a purely political issue.
> The entire GAI x-risk problem stems from the fact that we don't have a complete picture of this philosophy, and that we don't have a clue how to formalize it so we can communicate it fully to an AI.
This suggests that GAI x-risk discourse is not championed by serious thinkers who understand AI technology or moral philosophy. (Indeed, Lesswrong is basically a forum for clueless sci-fi TVTropes enjoyers, and they're behind most of it). Human morals are ad hoc preferences, not lossy approximations of some function; we can derive an approximating function from a big lump of human preferences, but it won't be legible or meaningfully amenable to formalization. As such, the closest we come is just finetuning models on the vague markers of human decency distilled in their general training data, e.g. like Anthopic does with their Constitutional AI. This is also the closest we came to AGI, so this should be our first-priority scenario for future AIs and aligning them – not speculations from the 90s about «formalizing» something.
> At the same time, with a system of this type, we have no way of telling if it actually understood human values and morals correctly.
We have too. Testing LLMs is vastly easier than testing humans, we have insight into their activations, we can steer them, there's a big body of research into that. More importantly, there is no strictly correct understanding, this whole idea ought to be thrown out.
What's really going on here is that some armchair Bentram-style utilitarians like Bostrom encountered literature on Reinforcement Learning and jumped to conclusion that this is how an AGI is to be built; if only they could formalize the correct vector of increasing utility, it would seize the light cone and optimize for the global utility maximum. And accordingly, if they failed, an AGI would optimize for something else, which would most likely (here's another assumption of a quasi-random objective selection) be at odds with human preferences or survival.
Since then, they have written a great deal of elicidations on this basic take, incuriously shoehorning new technologies into its confines. But no part of this hermeneutic tradition is in any way helpful for making sense of our current explosive success with tools like LLMs.
> But it's an inconsequential detail when dealing with entities that do not have the same common core
But why don't they? Just because some Lovecraft fans with Chūnibyō call natural language processors trained on human data Shoggoths, entities summoned from the Eldritch Space of Minds?
The AI risk discourse is incredibly sophomoric, imaginative in the bad sense. Once you learn to question its assumptions, it kind of falls apart.