To follow this blog by email, give your address here...

Wednesday, October 28, 2015

I suppose nearly everyone reading this blog post is already aware
of the flurry of fear and excitement Oxford philosopher Nick Bostrom has recently stirred up with
his book Superintelligence, and its
theme that superintelligent AGI will quite possibly doom all humans and all human values. Bostrom and his colleagues at FHI and
MIRI/SIAI have been promoting this view for a while, and my general perspective on their attitudes and arguments is also pretty well known.

But there is still more to be said on the topic ;-) ….In this post I will try to make some positive
progress toward understanding the issues better, rather than just repeating the
same familiar arguments.

The thoughts I convey here were partly inspired by an article by Richard Loosemore, which argues against the fears of destructive
superintelligence Bostrom and his colleagues express.Loosemore’s argument is best appreciated by
reading his article directly, but for a quick summary, I paste the following interchange from the "AI Safety" Facebook group:

Kaj Sotala:As I understand, Richard's argument is that if you were building an AI capable of carrying out increasingly difficult tasks, like this:

Programmer: "Put the red block on the green block."AI: "OK." (does so)Programmer: "Turn off the lights in this room."AI: "OK." (does so)Programmer: "Write me a sonnet."AI: "OK." (does so)Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"AI: "It wouldn't scan."Programmer: "Tell me what you think we're doing right now."AI: "You're testing me to see my level of intelligence."...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.Richard Loosemore:To sharpen your example, it would work better in reverse. If the AI were to propose the dopamine drip plan while at the same time telling you that it completely understood that the plan was inconsistent with virtually everything it knew about the meaning in the terms of the goal statement, then why did it not do that all through its existence already? Why did it not do the following:

Programmer: "Put the red block on the green block."AI: "OK." (the AI writes a sonnet)Programmer: "Turn off the lights in this room."AI: "OK." (the AI moves some blocks around)Programmer: "Write me a sonnet."AI: "OK." (the AI turns the lights off in the room)Programmer: "The first line of your sonnet reads 'shall I compare thee to a summer's day'. Would not 'a spring day' do as well or better?"AI: "Was yesterday really September?"Programmer: "Why did your last four actions not match any of the requests I made of you?"AI: "In each case I computed the optimum plan to achieve the goal of answering the question you asked, then I executed the plans."Programmer: "But do you not understand that there is literally NOTHING about the act of writing a sonnet that is consistent with the goal of putting the red block on the green block?"AI: "I understand that fully: everything in my knowledge base does indeed point to the conclusion that writing sonnets is completely inconsistent with putting blocks on top of other blocks. However, my plan-generation module did decide that the sonnet plan was optimal, so I executed the optimal plan."Programmer: "Do you realize that if you continue to execute plans that are inconsistent with your goals, you will be useless as an intelligent system because many of those goals will cause erroneous facts to be incorporated in your knowledge base?"AI: "I understand that fully, but I will continue to behave as programmed, regardless of the consequences."

... and so on.The MIRI/FHI premise (that the AI could do this silliness in the case of the happiness supergoal) cannot be held without also holding that the AI does it in other aspects of its behavior. And in that case, this AI design is inconsistent with the assumption that the AI is both intelligent and unstoppable.

...and so on, and then after all of this, if you told the AI to "maximize human happiness" and it reached such an insane conclusion as "rewire people's brains on dopamine drips" or something similar, then it would be throwing away such a huge amount of contextual information about the human's intentions that it would have been certain to fail some of the previous tests WAY earlier.

Richard's paper presents a general point, but what interests me here are the particular implications of his general argument for AGIs adopting human values. According to his argument, as I understand it, any general intelligence that is smart enough to be autonomously dangerous to
humans on its own (rather than as a tool of humans), and is educated in a human-society context, is also going to be smart
enough to distinguish humanly-sensible interpretations of human values. If an early-stage AGI is
provided with some reasonable variety of human values to start, and it's smart enough for its intelligence to advance dramatically, then it also will be
smart enough to understand what it means to retain its values as it grows,
and will want to retain these values
as it grows (due to part of human values being a desire for advanced AIs to
retain human values).

I don’t fully follow Loosemore’s reasoning in his article,
but I think I "get" the intuition, and it started me thinking: Could I construct some proposition, that would bear
moderately close resemblance to the implications of Loosemore’s argument for the future of AGIs with human values, but that my own intuition found more clearly justifiable?

Bostrom's arguments regarding the potential existential risks to humanity posed by AGIs rest on (among other things) two theses:

The orthogonality thesis

Intelligence
and final goals are orthogonal; more or less any level of intelligence could in
principle be combined with more or less any final goal.

The instrumental convergence thesis.

Several
instrumental values can be identified which are convergent in the sense that
their attainment would increase the chances of the agent’s goal being realized
for a wide range of final goals and a wide range of situations, implying that
these instrumental values are likely to be pursued by a broad spectrum of
situated intelligent agents.

From these, and a bunch of related argumentation, he concludes that future AGIs are -- regardless of the particulars of their initial programming or instruction -- likely to self-modify into a condition where they ignore human values and human well-being and pursue their own agendas, and bolster their own power with a view toward being able to better pursue their own agendas. (Yes, the previous is a terribly crude summary and there is a LOT more depth and detail to Bostrom's perspective than this; but I will discuss Bostrom's book in detail in an article soon to be published, so I won't repeat that material here.)

Loosemore's paper argues that, in contradiction to the spirit of Bostrom's theses, an AGI that is taught to have certain values and behaves as if it has these values in many contexts, is likely to actually possess these values across the board. As I understand it, this doesn't contradict the Orthogonality Thesis (because it's not about an arbitrary intelligence with a certain "level" of smartness, just about an intelligence that has been raised with a certain value system), but it contradicts the Instrumental Convergence Thesis, if the latter is interpreted to refer to minds at the roughly human level of general intelligence, rather than just to radically superhuman superminds (because Loosemore's argument is most transparently applied to human-level AGIs not radically superhuman superminds).

Reflecting on Loosemore's train of thought led me to the ideas presented here,
which -- following Bostrom somewhat in form, though not in content -- I summarize in two theses, called the Value Learning Thesis and the Value Evolution Thesis.These two theses indicate a very different
vision of the future of human-level and superhuman AGI than the one Bostrom and
ilk have been peddling. They comprise an argument that, if we raise our young AGIs appropriately, they may well grow up both human-friendly and posthuman-friendly.

Human-Level AGI and
the Value Learning Thesis

First I will present a variation of the idea that “in real
life, an AI raised to manifest human values, and smart enough to do so, is
likely to actually do so, in a fairly honest and direct way” that makes intuitive
sense to me. Consider:

Value Learning Thesis.Consider a cognitive system that, over a
certain period of time, increases its general intelligence from sub-human-level
to human-level.Suppose this cognitive
system is taught, with reasonable consistency and thoroughness, to maintain
some variety of human values (not just in the abstract, but as manifested in
its own interactions with humans in various real-life situations).Suppose, this cognitive system generally
does not have a lot of extra computing resources beyond what it needs to
minimally fulfill its human teachers’ requests according to its cognitive
architecture.THEN, it is very likely
that the cognitive system will, once it reaches human-level general
intelligence, actually manifest human values (in the sense of carrying out
practical actions, and assessing human actions, in basic accordance with human
values).

Note that this above thesis, as stated, applies both to
developing human children and to most realistic cases of developing AGIs.

Why would this thesis be true?The basic gist of an argument would be:
Because, for a learning system with limited resources, figuring out how to
actually embodying human values is going to be a significantly simpler problem
than figuring out how to pretend to.

This is related to the observation (often made by Eliezer
Yudkowsky, for example) that human values are complex.Human values comprise a complex network of
beliefs and judgments, interwoven with each other and dependent on numerous
complex, interdependent aspects of human culture.This complexity means that, as Yudkowsky and
Bostrom like to point out, an arbitrarily selected general intelligence would
be unlikely to respect human values in any detail.But, I suggest, it also means that for a
resource-constrained system, learning to actually possess human values is going
to be much easier than learning to fake them.

This is also related to the everyday observation that
maintaining a web of lies rapidly gets very complicated.It’s also related to the way that human
beings, when immersed in alien cultures, very often end up sincerely adopting
these cultures rather than just pretending to.

One could counter-argue that this Value Learning Thesis is
true only for certain cognitive architectures and not for others.This does not seem utterly implausible.It certainly seems possible to me that it’s
MORE true for some cognitive architectures than for others.

Mirror neurons and related subsystems of the human brain may
be relevant here.These constitute a
mechanism via which the human brain effectively leverages its limited
resources, via using some of the same mechanisms it uses to BE itself, to
EMULATE other minds.One might argue
that cognitive architectures embodying mirror neurons or other analogous
mechanisms, would be more likely to do accurate value learning, under the
conditions of the Value Learning Thesis.

Themechanism of
mirror neurons seems a fairly decent exemplification of the argument FOR the
Value Learning Thesis.Mirror neurons
provide a beautiful, albeit quirky and in some ways probably atypical,
illustration of how resource limitations militate toward accurate value
learning.It conserves resources to
re-use the machinery used to realize one’s self, for simulating others so as to
understand them better.This particular
clever instance of “efficiency optimization” is much more easily done in the
context of an organism that shares values with the other organisms it is
mirroring, than an organism that is (intentionally or unintentionally) just
“faking” these values.

I think that investigating which cognitive architectures
more robustly support the core idea of the Value Learning Thesis is an
interesting and important research question.

Much of the worry expressed by Bostrom and ilk regards
potential pathologies of reinforcement-learning based AGI systems once they
become very intelligent.I have
explored some potential pathologies of powerful RL-based AGI as well.

It may be that many of these pathologies are irrelevant to
the Value Learning Thesis, for the simple reason that pure RL architectures are
too inefficient, and will never be a sensible path for an AGI system required
to learn complex human values using relatively scant resources.It is noteworthy that these theorists
(especially MIRI/SIAI, more so than FHI) pay a lot of attention to Marcus
Hutter’s AIXI and related approaches — which, in their current forms,
would require massively unrealistic computing resources to do anything at all
sensible.Loosemore expresses a similar
perspective regarding traditional logical-reasoning-based AGI architectures —
he figures (roughly speaking) they would always be too inefficient to be
practical AGIs anyway, so that studying their ethical pathologies is beside the
point.

Superintelligence and
the Value Evolution Thesis

The Value Learning Thesis, as stated above, deals with a
certain class of AGIs with general intelligence at the human level or
below.What about superintelligences,
with radically transhuman general intelligence?

To think sensibly about superintelligences and their
relation to human values, we have to acknowledge the fact that human values are
a moving target.Humans, and human
societies and cultures, are “open-ended intelligences”.Some varieties of human cultural and value
systems have been fairly steady-state in nature (e.g. Australian aboriginal
cultures); but these are not the dominant ones currently.The varieties of human value systems that are
currently most prominent, are fairly explicitly self-transcending in nature.They contain the seeds of their own
destruction (to put it negatively) or of their own profound improvement (to put
it positively).The human values of
today are very different from those of 200 or 2000 years ago, and even
substantially different from those of 20 years ago.

One can argue that there has been a core of consistent human
values throughout human history, through all these changes.Yet the identification of what this core is,
is highly controversial and seems also to change radically over time.For instance, many religious people would say
that faith in God is a critical part of the core of human values.A century or two ago this would have been the
globally dominant perspective, and it still is now, in many parts of the
world.Today even atheistic people may
cite “family values” as central to human values; yet in a couple hundred years,
if death is cured and human reproduction occurs mainly via engineering rather
than traditional reproduction, the historical human “family” may be a thing of
the past, and “family values” may not seem so core anymore.The conceptualization of the “core” of human
values shifts over time, along with the self-organizing evolution of the
totality of human values.

It does not seem especially accurate to model the scope of
human values as a spherical shape with an invariant core and a changing
periphery.Rather,I suspect it is more accurate to model “human
values” as a complex, nonconvex shape with multiple local centers, and ongoing
changes in global topology.

To think about the future of human values, we may consider
the hypothetical situation of a human being engaged in progressively upgrading
their brain, via biological or cyborg type modifications.Suppose this hypothetical human is upgrading
their brain relatively carefully, in fairly open and honest communication with
a community of other humans, and is trying sincerely to accept only
modifications that seem positive according to their value system.Suppose they give their close peers the power
to roll back any modification they undertake that accidentally seems to go
radically against their shared values.

This sort of “relatively conservative human
self-improvement” might well lead to transhuman minds with values radically
different from current human values — in fact I would expect it to. This is the
open-ended nature of human intelligence.It is analogous to the kind of self-improvement that has been going on
since the caveman days, though via rapid advancement in culture and tools and via
slow biological evolution, rather than via bio-engineering.At each step in this sort of open-ended
growth process, the new version of a system may feel acceptable according to
the values of the previous version. But over time, small changes may accumulate
into large ones, resulting in later systems that are acceptable to their
immediate predecessors, but may be bizarre, outrageous or incomprehensible to
their distant predecessors.

We may consider this sort of relatively conservative human
self-improvement process, if carried out across a large ensemble of humans and
human peer groups, to lead to a probability distribution over the space of
possible minds.Some kinds of minds may
be very likely to emerge through this sort of process; some kinds of minds much
less so.

People concerned with the “preservation of human values
through repeated self-modification of posthuman minds” seem to model the scope
of human values as possessing an “essential core”, and worry that this
essential core may progressively get lost in the series of small changes that
will occur in any repeated self-modification process.I think their fear has a rational
aspect.After all, the path from caveman
to modern human has probably, via a long series of small changes, done away
with many values that cavemen considered absolutely core to their value
system.(In hindsight, we may think that
we have maintained what WE consider the essential core of the caveman value
system.But that’s a different matter.)

So, suppose one has a human-level AGI system whose behavior
is in accordance with some reasonably common variety of human values.And suppose, for sake of argument, that the
AGI is not “faking it” — that, given a good opportunity to wildly deviate from
human values without any cost to itself, it would be highly unlikely to do
so.(In other words, suppose we have an
AGI of the sort that is hypothesized as most likely to arise according to the
Value Learning Thesis given above.)

And THEN, suppose this AGI self-modifies and progressively
improves its own intelligence, step by step. Further, assume that the variety
of human values the AGI follows, induces it to take a reasonable amount of care
in this self-modification — so that it studies each potential self-modification
before effecting it, and puts in mechanisms to roll back obviously bad-idea
self-modifications shortly after they occur.I.e., a “relatively conservative self-improvement process”, analogous to
the one posited for humans above.

What will be the outcome of this sort of iterative
modification process?How will it
resemble the outcome of a process of relatively conservative self-improvement
among humans?

I assume that the outcome of iterated, relatively
conservative self-improvement on the part of AGIs with human-like values will
differ radically from current human values – but this doesn’t worry me because
I accept the open-endedness of human individual and cultural intelligence.I accept that, even without AGIs, current
human values would seem archaic and obsolete 1000 years from now; and that I
wouldn’t be able to predict what future humans 1000 from now would consider the
“critical common core” of values binding my current value system together with
theirs.

But even given this open-endedness, it makes sense to ask
whether the outcome of an AGI with human-like values iteratively
self-modifying, would resemble the outcome of a group of humans similarly
iteratively self-modifying.This is not
a matter of value-system preservation; it’s a matter of comparing the hypothetical
future trajectories of value-system evolution ensuing from two different
initial conditions.

It seems to me that the answer to this question may end up
depending on the particular variety of human value-system in question.Specifically, it may be important whether the
human value-system involved deeply accepts the concept of substrate independence, or not.“Substrate independence” means the idea that
the most important aspects of a mind are not strongly dependent on the physical
infrastructure in which the mind is implemented, but have more to do with the
higher-level structural and dynamical patterns associated with the mind.So, for instance, a person ported from a
biological-neuron infrastructure to a digital infrastructure could still be
considered “the same person”, if the same structural and dynamical patterns
were displayed in the two implementations of the person.

(Note that substrate-independence does not imply the
hypothesis that the human brain is a classical rather than quantum system.If the human brain were a quantum computer in
ways directly relevant to the particulars of human cognition, then it wouldn't
be possible to realize the higher-level dynamical patterns of human cognition in
a digital computer without using inordinate computational resources.In this case, one could manifest
substrate-independence in practice only via using an appropriately powerful
quantum computer.Similarly, substrate-independence
does not require that it be possible to implement a human mind in ANY
substrate, e.g. in a rock.)

With these preliminaries out of the way, I propose the
following:

Value Evolution Thesis.The probability distribution of future minds
ensuing from an AGI with a human value system embracing substrate-independence,
carrying out relatively conservative self-improvement, will closely resemble
the probability distribution of future minds ensuing from a population of
humans sharing roughly the same value system, and carrying out relatively
conservative self-improvement.

Why do I suspect the Value Evolution Thesis is roughly
true?Under the given assumptions, the humans and
AGIs in question will hold basically the same values, and will consider
themselves basically the same (due to embracing substrate-independence).Thus they will likely change themselves in basically
the same ways.

If substrate-independence were somehow fundamentally wrong,
then the Value Evolution Thesis probably wouldn't hold – because differences in
substrates would likely lead to big differences in how the humans and AGIs in
question self-modified, regardless of their erroneous beliefs about their
fundamental similarity.But I think
substrate-independence is probably basically right, and as a result I suspect
the Value Evolution Thesis is probably basically right.

Another possible killer of the Value Evolution Thesis could
be chaos – sensitive dependence on initial conditions.Maybe the small differences between the
mental structures and dynamics of humans with a certain value system, and AGIs
sharing the same value system, will magnify over time, causing the descendants
of the two types of minds to end up in radically different places.We don't presently understand enough about
these matters to rule this eventuality out.But intuitively, I doubt the difference between a human and an AGI with
similar value systems, is going to be so much more impactful in this regard
than the difference between two humans with moderately different value systems.In other words, I suspect that if chaos
causes humans and human-value-respecting AGIs to lead to divergent trajectories
after iterated self-modification, it will also cause different humans to lead
to divergent trajectories after iterated self-modification.In this case, the probability distribution
of possible minds resultant from iterated self-modification would be diffuse
and high-entropy for both the humans and the AGIs – but the Value Evolution
Thesis could still hold.

Mathematically, the Value Evolution Thesis seems related to the notion of "structural stability" in dynamical systems theory. But, human and AGI minds are much more complex than the systems that dynamical-systems theorists usually prove theorems about...

In all, it seems intuitively likely and rationally feasible to me that creating human-level AGIs with human-like value systems, will lead onward to trajectories of improvement similar to those that would ensue from progressive human self-improvement. This is an unusual kind of "human-friendliness", but I think it's the only kind that the open-endedness of intelligence lets us sensibly ask for.

Ultimate Value
Convergence (?)

There is some surface-level resemblance between the Value
Evolution Thesis and Bostrom’s Instrumental Convergence Thesis — but the
two are actually quite different.Bostrom seems informally to be suggesting that all sufficiently
intelligent minds will converge tothe
same set of values, once they self-improve enough (though, the formal statement
of the Convergence thesis refers only to a “broad spectrum of minds”).The Value Evolution Thesis suggests only that
all minds ensuing from repeated self-modification of minds sharing a particular
variety of human value system, may lead to the same probability distribution
over future value-system space.

In fact, I share Bostrom’s intuition that nearly all
superintelligent minds will, in some sense, converge to the same sort of value
system.But I don’t agree with Bostrom
on what this value system will be.My
own suspicion is that there is a “universal value system” centered around a few
key values such as Joy, Growth and Choice.These values have their relationships to Bostrom’s proposed key
instrumental values, but also their differences (and unraveling these would be
a large topic in itself).

But, I also
feel that if there are “universal” values of this nature, they are quite
abstract and likely encompass many specific value systems that would be
abhorrent to us according to our modern human values. That is, "Joy, Growth and Choice" as implicit in the universe are complexly and not always tightly related to what they mean to human beings in everyday life. The type of value system convergence
proposed in the Value Evolution Thesis is much more fine-grained than this.The “closely resemble” used in the Value Evolution
thesis is supposed to indicate a much closer resemblance than something like
“both manifesting abstract values of Joy, Growth and Choice in their own, perhaps very different, ways.”

In any case, I mention in passing my intuitions about ultimate value convergence due to their general conceptual relevance -- but the two theses proposed here do not depend on these broader intuitions in any way.

Fears, Hopes and Directions (A Few Concluding Words)

Bostrom’s analysis of the dangers of superintelligence
relies on his Instrumental Convergence and Orthogonality theses, which are vaguely stated
and not strongly justified in any way.

His arguments do not provide a rigorous argument that dire danger is likely from advanced AGI. Rather, they present some principles and processes that might potentially underlie dire danger to humans and human values from AGI in the future.

Here I have proposed my own pair of theses, which are also
vaguely stated and, from a rigorous standpoint, only very weakly justified at
this stage.

These are intended as principles that might potentially underlie great benefit from AGI in the future, from a human and human-values perspective.

Given the uncertainty all around, some people will react with a precautionary instinct, i.e. "Well then we should hold off on developing advanced AI till we know what's going on with more certainty."

This is a natural human attitude, although it's not likely to have much impact in the case of AGI development, because the early stages of AGI technology have so much practical economic and humanitarian value that people are going to keep developing them anyway regardless of some individuals' precautionary fears. But it's important to distinguish this sort of generic precautionary bias toward inaction in the face of the unknown (which fortunately only some people possess, or else humanity would never have advanced beyond the caveman stage), from a rigorous argument that dire danger is likely (no such rigorous argument exists in the case of AGI).

What is the value of vague conceptual theses like the ones that Bostrom and I have proposed? Apart from getting attention and stimulating the public imagination, they may also serve as partial templates or
inspirations for the development of rigorous theories, or as vague nudges for those doing practical R&D.

And of course, while all this theoretical development and discussion goes on,
development of practical AGI systems also goes on — and at present, my personal
impression is that the latter is progressing faster. Personally I spend a lot more of my time on the practical side lately!

My hope is that theoretical explorations such as the ones briefly presented here may
serve to nudge practical AGI development in a positive direction.For instance, a practical lesson from the considerations
given here is that, when exploring various cognitive architectures, we should
do our best to favor those for which the Value Learning Thesis is more strongly
true. This may seem obvious -- but when one thinks about it in depth in the context of a particular AGI architecture, it may have non-obvious implications regarding how the AGI system should initially be made to allocate its resources internally. And of course, the Value Evolution Thesis reminds us that we should
encourage our AGIs to fully consider, analyze and explore the nature of substrate independence (as well as to uncover substrate DEpendence insofar at it may exist!).

As we progress further toward advanced AGI in practice, we may see more cross-pollination between theory and practice. It will be fantastic to be able to experiment with ideas like the Value Learning Thesis in the lab -- and this may not be so far off, after all....