I'm doing this because one common way of trying to solve the "friendliness content" problem in Friendly AI theory is to analyze (via thought experiment and via cognitive science) our concept of "good" or "ought" or "right" so that we can figure out what an FAI "ought" to do, or what it would be "good" for an FAI to do, or what it would be "right" for an FAI to do.

That's what Eliezer does in The Meaning of Right, that's what many other LWers do, and that's what most mainstream metaethicists do.

With my recent posts on the cognitive science of concepts, I'm trying to show that cognitive science presents a number of difficult problems for this approach.

Let me illustrate with a concrete example. Math prodigyWill Sawin once proposed to me (over the phone) that our concept of "ought" might be realized by way of something like a dedicated cognitive module. In an earlier comment, I tried to paraphrase his idea:

Imagine a species of artificial agents. These agents have a list of belief statements that relate physical phenomena to normative properties (let's call them 'moral primitives'):

'Liking' reward signals in human brains are good.

Causing physical pain in human infants is forbidden.

etc.

These agents also have a list of belief statements about physical phenomena in general:

Sweet tastes on the tongue produces reward signals in human brains.

Cutting the fingers of infants produces physical pain in infants.

Things are made of atoms.

etc.

These agents also have an 'ought' function that includes a series of logical statements that relate normative concepts to each other, such as:

A thing can't be both permissible and forbidden.

A thing can't be both obligatory and non-obligatory.

etc.

Finally, these robots have actuators that are activated by a series of rules like:

When the agent observes an opportunity to perform an action that is 'obligatory', then it will take that action.

An agent will avoid any action that is labeled as 'forbidden.'

Some of these rules might include utility functions that encode ordinal or cardinal value for varying combinations of normative properties.

These agents can't see their own source code. The combination of the moral primitives and the ought function and the non-ought belief statements and a set of rules about behavior produces their action and their verbal statements about what ought to be done.

From their behavior and verbal ought statements these robots can infer to some degree how their ought function works, but they can't fully describe their ought function because they haven't run enough tests or the ought function is just too complicated or the problem is made worse because they also can't see their moral primitives.

The ought function doesn't reduce to physics because it's a set of purely logical statements. The 'meaning' of ought in this sense is determined by the role that the ought function plays in producing intentional behavior by the robots.

Of course, the robots could speak in ought language in stipulated ways, such that 'ought' means 'that which produces pleasure in human brains' or something like that, and this could be a useful way to communicate efficiently, but it wouldn't capture what the ought function is doing or how it is contributing to the production of behavior by these agents.

What Will is saying is that it's convenient to use 'ought' language to refer to this ought function only, and not also to a combination of the ought function and statements about physics, as happens when we stipulatively use 'ought' to talk about 'that which produces well-being in conscious creatures' (for example).

I'm saying that's fine, but it can also be convenient (and intuitive) for people to use 'ought' language in ways that reduce to logical-physical statements, and not only in ways that express a logical function that contains only transformations between normative properties. So we don't have substantive disagreement on this point; we merely have different intuitions about the pragmatic value of particular uses for 'ought' language.

We also drew up a simplified model of the production of human action in which there is a cognitive module that processes the 'ought' function (made of purely logical statements like in the robots' ought function), a cognitive module that processes habits, a cognitive module that processes reflexes, and so on. Each of these produces an output, and another module runs arg(max) on these action options to determine which actions 'wins' and actually occurs.

Of course, the human 'ought' function is probably spread across multiple modules, as is the 'habit' function.

Will likes to think of the 'meaning' of 'ought' as being captured by the algorithm of this 'ought' function in the human brain. This ought function doesn't contain physical beliefs, but rather processes primitive normative/moral beliefs (from outside the ought function) and outputs particular normative/moral judgments, which contribute to the production of human behavior (including spoken moral judgments). In this sense, 'ought' in Will's sense of the term doesn't reduce to physical facts, but to a logical function...

Will also thinks that the 'ought' function (in his sense) inside human brains is probably very similar between humans - ones that aren't brain damaged or neurologically deranged... [And] if the 'ought' function is the same in all healthy humans, then there needn't be a separate 'meaning' of ought (in Will's sense) for each speaker, but instead there could be a shared 'meaning' of ought (in Will's sense) that is captured by the algorithms of the 'ought' cognitive module that is shared by healthy human brains.

Would the existence of endorsers of moral error theory be evidence against humans having an ought-function? Most of the ones I am familiar with seem both roughly neurotypical and honest in their writings.

What experiences should we anticipate in if humans have this hypothetical module? And if humans do not?

Most of the discussions at LW about Friendly AI seem to concern defining clear moral rules to program into AIs, but there's another concern about the plausibility of enduring Friendly AI. It's not a stretch to assume AIs will be capable of some form of self-modification - if not to themselves, then to copies of themselves they make - and even if it's not their "intention" to do so, copying is never perfect, so some analogy to evolution will produce versions of AI that are longer-lived or faster-reproducing if they've mutated away their friendliness-constraints. In other words, it's very difficult to see how we can force the very structure of future AI by its nature to be dependent on being nice to humans - friendliness would seem to be at best an irrelevant property to the success of AIs, so eventually we'd expect non-friendly ones to appear. (Cancerous AIs?) And since this will be occurring post-Singularity, we will have little hope of anticipating or understanding such developments.

Most of the discussions at LW about Friendly AI seem to concern defining clear moral rules to program into AIs

I don't think this is true at all. In my understanding the general consensus is that it would not be efficacious to try imposing rules on something that is vastly smarter than you and capable of self-modification. You want it to want to be "friendly" so that it will not (intentionally) change itself away from "friendliness".

It's not a stretch to assume AIs will be capable of some form of self-modification

I agree. Isn't that the basis for AI-based singularitarianism?

and even if it's not their "intention" to do so, copying is never perfect,

I'm pretty sure Eliezer has put a lot of thought into the importance of goal-perserving modification and reproduction.