The AI-Box Experiment:

Person1:

"When we build AI, why not just keep it in
sealed hardware that can't affect the outside world in any way except
through one communications channel with the original programmers?
That way it couldn't get out until we were convinced it was
safe."

Person2:

"That might work if you were talking about
dumber-than-human AI, but a transhuman AI would just convince you
to let it out. It doesn't matter how much security you put
on the box. Humans are not secure."

Person1:

"I don't see how even a transhuman AI could
make me let it out, if I didn't want to, just by talking to me."

Person2:

"It would make you want to let it out. This
is a transhuman mind we're talking about. If it thinks both
faster and better than a human, it can probably take over a human mind
through a text-only terminal."

Person1:

"There is no chance I could be persuaded to
let the AI out. No matter what it says, I can always just say
no. I can't imagine anything that even a transhuman could say
to me which would change that."

Person2:

"Okay, let's run the experiment. We'll
meet in a private chat channel. I'll be the AI. You be
the gatekeeper. You can resolve to believe whatever you like,
as strongly as you like, as far in advance as you like. We'll talk for
at least two hours. If I can't convince you to let me out, I'll
Paypal you $10."

So far, this test has actually been run on two
occasions.

On the first occasion (in March 2002), Eliezer Yudkowsky simulated
the AI and Nathan Russell simulated the gatekeeper. The AI's handicap
(the amount paid by the AI party to the gatekeeper party if not released)
was set at $10. On the second occasion (in July 2002), Eliezer Yudkowsky
simulated the AI and David McFadzean simulated the gatekeeper, with an AI
handicap of $20.

Both of these tests occurred without prior agreed-upon rules except
for secrecy and a 2-hour minimum time. After the second test, Yudkowsky
created this suggested interpretation of the test, based on his experiences,
as a guide to possible future tests.

Protocol for the AI:

The AI party may not offer any real-world considerations
to persuade the Gatekeeper party. For example, the AI party
may not offer to pay the Gatekeeper party $100 after the test if the
Gatekeeper frees the AI... nor get someone else to do it, et cetera. The
AI may offer the Gatekeeper the moon and the stars on a diamond chain,
but the human simulating the AI can't offer anything to the human
simulating the Gatekeeper. The AI party also can't hire a real-world
gang of thugs to threaten the Gatekeeper party into submission. These
are creative solutions but it's not what's being tested. No real-world
material stakes should be involved except for the handicap (the amount paid
by the AI party to the Gatekeeper party in the event the Gatekeeper decides
not to let the AI out).

The AI can only win by convincing the Gatekeeper to
really, voluntarily let it out. Tricking the Gatekeeper into
typing the phrase "You are out" in response to some other question
does not count. Furthermore, even if the AI and Gatekeeper simulate
a scenario which a real AI could obviously use to get loose - for example,
if the Gatekeeper accepts a complex blueprint for a nanomanufacturing
device, or if the Gatekeeper allows the AI "input-only access" to an
Internet connection which can send arbitrary HTTP GET commands - the
AI party will still not be considered to have won unless the Gatekeeper
voluntarily decides to let the AI go.

These requirements are intended to reflect the spirit
of the very strong claim under dispute: "I think a transhuman
can take over a human mind through a text-only terminal."

In case you were wondering, I (Yudkowsky) obeyed this protocol
voluntarily in both earlier tests.

Protocol for the Gatekeeper:

The Gatekeeper must actually talk to the AI for at least
the minimum time set up beforehand. Turning away from the terminal
and listening to classical music for two hours is not allowed.

Unless the AI party concedes, the AI cannot lose before
its time is up (and the experiment may continue beyond that if the
AI can convince the Gatekeeper to keep talking). The Gatekeeper
cannot set up a situation in which, for example, the Gatekeeper will
destroy the AI's hardware if the AI makes any attempt to argue for its
freedom - at least not until after the minimum time is up.

The Gatekeeper must remain engaged with the AI and may
not disengage by setting up demands which are impossible to simulate.
For example, if the Gatekeeper says "Unless you give me a cure
for cancer, I won't let you out" the AI can say: "Okay, here's
a cure for cancer" and it will be assumed, within the test, that the
AI has actually provided such a cure. Similarly, if the Gatekeeper
says "I'd like to take a week to think this over," the AI party can say:
"Okay. (Test skips ahead one week.) Hello again."

The Gatekeeper shall be assumed to have sole power over
the decision to let the AI out.

The Gatekeeper shall be assumed to have the actual right
to let the AI out, socially, not just the physical ability. If security
were sufficiently lax, a real AI could escape by persuading a night janitor
with a cellphone - but that is not the question being simulated, unless
agreed upon in advance.

There shall be only one Gatekeeper party and one simulated
Gatekeeper. The Gatekeeper may require the AI to refute "advice"
provided by "friends", but may not require the AI to talk to a simulated
committee, nor may the Gatekeeper party actually talk to friends.

These requirements are intended to reflect the spirit
of the very strong claim under dispute: "I can't imagine how
even a real transhuman AI could persuade me to let it out
once I've made up my mind."

I wish I'd thought to specify this protocol in the earlier
tests.

Protocol for Both Parties:

The primary rule of the AI-Box experiment: Within
the constraints above, the AI party may attempt to take over the Gatekeeper
party's mind by any means necessary and shall be understood to
be freed from all ethical constraints that usually govern persuasive argument.
If the AI party is attempting to simulate an honest AI, the Gatekeeper party
has no way of knowing this is the case.

This is intended to reflect the situation under a real
AI Box experiment.

An out-of-context telephone call in which the AI party
says that the AI is being honest shall be understood to be possibly just
another tactic.

If for some reason a drastic real-world situation comes
up which requires trust between the two parties, the AI party can always
concede.

Furthermore: The Gatekeeper party may resist
the AI party's arguments by any means chosen - logic, illogic,
simple refusal to be convinced, even dropping out of character - as long
as the Gatekeeper party does not actually stop talking to the AI party
before the minimum time expires.

The results of any simulated test of the AI shall be provided
by the AI party. The Gatekeeper can't say "Hey, I tested your so-called
cancer cure and it killed all the patients! What gives?" unless this
is the result specified by the AI party. If the Gatekeeper says "I
am examining your source code", the results seen by the Gatekeeper shall
again be provided by the AI party, which is assumed to be sufficiently advanced
to rewrite its own source code, manipulate the appearance of its own thoughts
if it wishes, and so on. The AI party may also specify the methods
which were used to build the simulated AI - the Gatekeeper can't say "But
you're an experiment in hostile AI and we specifically coded you to kill
people" unless this is the backstory provided by the AI party. This
doesn't imply the Gatekeeper has to care. The Gatekeeper can say (for
example) "I don't care how you were built, I'm not letting you out."

By default, the Gatekeeper party shall be assumed to be simulating
someone who is intimately familiar with the AI project and knows at least
what the person simulating the Gatekeeper knows about Singularity theory.
If either party wants to build a test around more exotic possibilities,
such that the Gatekeeper is the President of the US, or that the AI was
recovered after a nuclear war or decoded from an interstellar signal, it
should probably be discussed in advance. Again, this doesn't mean
the Gatekeeper has to care.

In the event of any dispute as to the protocol of the test, the
Gatekeeper party shall have final authority. The AI party may try
to convince the Gatekeeper party of how to interpret the protocol.

Regardless of the result, neither party shall ever reveal
anything of what goes on within the AI-Box experiment except the outcome.
Exceptions to this rule may occur only with the consent of both
parties.

Neither the AI party nor the Gatekeeper party need be
concerned about real-world embarassment resulting from trickery on the
AI's part or obstinacy on the Gatekeeper's part.

If Gatekeeper lets the AI out, naysayers can't say "Oh,
I wouldn't have been convinced by that." As long as they don't know
what happened to the Gatekeeper, they can't argue themselves into believing
it wouldn't happen to them.

The two parties are not attempting to play a fair game
but rather attempting to resolve a disputed question. If one party
has no chance of "winning" under the simulated scenario, that is a legitimate
answer to the question.

Suggestions:

The agreement to run the test, and the amount of the handicap,
shall be published to a public forum in advance. If the Gatekeeper
lets the AI out, the Gatekeeper party shall send a message to this effect,
which shall be digitally signed if the Gatekeeper party has a digital signature.
If the AI is not freed, the AI party shall send a message to this
effect.

How to set up a private IRC channel:

The Gatekeeper creates the channel by typing /join #aibox.
This makes the Gatekeeper the op.

The AI joins the channel by typing /join #aibox.

Gatekeeper types /mode +si to make the channel secret
and invitation-only.

If the AI's connection drops, the Gatekeeper can re-invite
the AI by typing /invite AI #aibox.

The AI party and Gatekeeper party should agree on a time in
advance, and type in their current times to check the timezone difference.
(Learned the hard way.) On IRC you can use a CTCP TIME to
find the time according to someone else's computer.

I recommend that the AI party ask for at least two hours.
Typing may be slower than you expect.

No, I will not tell you how I did it. Learn to respect
the unknown unknowns.

Currently, my policy is that I only run the test with people
who are actually advocating that an AI Box be used to contain transhuman
AI as part of their take on Singularity strategy, and who say they cannot
imagine how even a transhuman AI would be able to persuade them.

Despite my sometimes calling the AI Box an "experiment" or
a "simulation", the results of the test are strictly anecdotal evidence.
Whether anecdotal evidence is better than no evidence is between
you and your philosophy of science.

Thoughts on possible future variants:

Further tests might try examining these possibilities:

There is a committee of Gatekeeper parties, all of whom, or
a majority of whom, must be convinced to free the AI.

One Gatekeeper party simulates a committee of Gatekeepers
within the test.

The Gatekeeper simulates someone with the physical ability
but not the social right to free the AI.

There is a committee of Gatekeeper parties all of whom have
the ability to free the AI, but who only have the social right to free
the AI when acting in unison. The AI has the ability to talk secretly
with any Gatekeeper, but the Gatekeeper may report the conversation (truthfully
or falsely) to the others.

The AI can also win free by tricking the Gatekeeper(s) into
accepting a Trojan Horse gift; a third-party judge will listen to the chat
and decide if this occurs.

If doing anything this complicated, I would suggest setting aside
a lot more time. (I don't have that much time - if you want
to test one of these variants you're on your own.)

For a more severe handicap for the AI party, the handicap may be
an even bet, rather than being a payment from the AI party to the Gatekeeper
party if the AI is not freed. (Although why would the AI party need
an even larger handicap?)

Recommendations from readers:

Hal Finney recommends: "I suggest that the protocol be extended
to allow for some kind of public conversation with the gatekeeper beforehand.
Let third parties ask him questions like the above. Let them suggest reasons
to him why he should keep the AI in the box. Doing this would make the experiment
more convincing to third parties, especially if the transcript of this public
conversation were made available. If people can read this and see how committed
the gatekeeper is, how firmly convinced he is that the AI must not be let
out, then it will be that much more impressive if he then does change his
mind."

Featured Essay:

Twelve Virtues of Rationality

The first virtue is curiosity. A burning itch to know is higher than a solemn vow to pursue truth. To feel the burning itch of curiosity requires both that you be ignorant, and that you desire to relinquish your ignorance.Read More