The code is written in Python 3, but should be compatible with Python 2.

The code can be used to calculate the value estimates of the decision theories discussed in our paper Sequential Extensions of Causal and Evidential Decision Theory. Generally, the implemented value functions take as input a distribution defining the environmental probabilities, an action/policy, lists of possible percepts and hidden states, and a utility function. The implementation aims to follow the formulas in the paper as closely as possible. See the code documentation for details.

The results of the calculations are displayed below the code boxes. If you wish to re-run some code (with or without changes), please first run all earlier code boxes in order, as functions in later code boxes depend on earlier code boxes having been executed at least once.

In a decision problem,
we take one action $a \in \mathcal{A}$, receive a percept $e \in \mathcal{E}$
(typically called outcome in the decision theory literature)
and get a payoff according to the utility function
$u: \mathcal{E} \to [0, 1]$.
We assume that the set of actions $\mathcal{A}$ and the set of percepts $\mathcal{E}$ are finite.
Additionally,
the environment contains a hidden state $s \in \mathcal{S}$.
The hidden state holds information that is inaccessible to the agent
at the time of the decision,
but may influence the decision and the percept.
Formally, the environment is given by
a probability distribution $P$ over the hidden state, the action and the percept
that factors according to a causal graph (Pearl, 2009).

A causal graph over the random variables $x_1,\dots,x_n$ is
a directed acyclic graph with nodes $x_1,\dots,x_n$ and
probability distributions $P(x_i\mid pa_i)$ for each node $x_i$
where $pa_i$ is the set of parents of $x_i$ in the graph.
It is natural to identify the causal graph with
the factored distribution $P(x_1,\dots, x_n) = \prod_{i=1}^n P(x_i\mid pa_i)$.
Given such a causal graph,
the $\texttt{do}$-operator is defined as
$$
P(x_1, \ldots, x_j, \ldots, x_n \mid \texttt{do}(x_j := b))
= \frac{\prod_{i=1}^n P(x_i \mid pa_i)}{P(x_j \mid pa_j)}
$$
if $x_j = b$ on the left hand side and $0$ otherwise.
The result is a new probability distribution
that can be marginalized and conditioned in the standard way.
Intuitively, intervening on node $x_j$ means
ignoring all incoming arrows to $x_j$, as the effects they represent
are no longer relevant when we intervene;
the division removes the $P(x_j \mid pa_j)$ factor from $P(x_1,\dots,x_n)$.
Note that the $\texttt{do}$-operator is only defined for distributions
for which a causal graph has been specified.
See (Pearl, 2009, Ch. 3.4) for details.

In the first two (one-shot) examples, the environment is given as
the following causal graph over the hidden state $s$, action $a$, and percept $e$.

The environment has a prior belief over the agent's actions,
but that prior does not have to assign high probability to the
actions that the agent actually ends up taking.
We think of this prior as partial information about the agent
or in multi-agent systems beliefs held by other agents.
Since we assume the agent's policy to be deterministic and inside the environment,
the environment could actually know all the agent's actions in advance.
However, for self-consistency reasons, the agent needs to be
uncertain about what the environment knows about it:
if the agent knew which action she will take,
she could decide to take different actions which leads to a paradox.

In [1]:

# Imports and Helper functionsfromfractionsimportFraction# exact arithmetic with fractionsdefdistribution(d):""" Return a probability distribution corresponding the probability mu(a | b) Input: a dictionary specifying the distribution Output: a function that queries the dictionary and returns 0 if nothing is found """defmu(a,b=''):#print("evaluating ",a,b)ifb=='':returnd[a]ifaindelse0else:returnd[(a,b)]if(a,b)indelse0returnmu

In Newcomb's Problem there are two boxes:
an opaque box that is either empty or contains one million dollars and
a transparent box that contains one thousand dollars.
The agent can choose between taking only the opaque box ("one-boxing")
and taking both boxes ("two-boxing").
The content of the opaque box is determined by a very reliable predictor:
If it predicts the agent will one-box, the box contains the million, and
if it predicts the agent will two-box, the box is empty.

In Newcomb's problem
EDT prescribes to one-box because one-boxing is evidence that
the box contains a million dollars.
In contrast, CDT prescribes to two-box because two-boxing dominates one-boxing:
in either case we are a thousand dollars richer,
and our decision cannot causally affect the prediction.
Newcomb's problem has been raised as a critique to CDT,
but many philosophers insist that two-boxing is in fact
the rational choice, even if it makes you end up poor.

This problem takes place in a world in which there is a certain parasite that
cause their host to be attracted to cats,
in addition to uncomfortable sideeffects.
The agent is handed an adorable little kitten and
is faced with the decision of whether or not to pet it.
Petting the kitten feels nice and
therefore yields more utility than not petting it.
However, people suffering from the parasite are more likely to pet the kitten.

Petting the kitten constitutes evidence
of the presence of the parasite, and thus EDT recommends against it.
CDT correctly observes that there is no causal connection between
petting the kitten and having the parasite,
and is therefore in favor of petting.

For the remainder of this notebook,
we consider an environment $\mu$ that
the agent interacts with sequentially:
at time step $t$ the agent chooses an action $a_t \in \mathcal{A}$ and
receives a percept $e_t \in \mathcal{E}$
which yields a utility of $u(e_t) \in \mathbb{R}$;
the cycle then repeats for $t + 1$.
A history is an element of $(\mathcal{A} \times \mathcal{E})^*$.
We use $æ \in \mathcal{A} \times \mathcal{E}$ to denote one interaction cycle,
and $æ_{<t}$ to denote a history of length $t - 1$.
A policy is a function that maps a history $æ_{<t}$ to
the next action $a_t$.
We only consider deterministic policies.

This section assumes that the environment is fully known
except for some stochasticity and an unknown hidden state.
In other words, it is a planning problem.
The hidden state influences both percepts and actions,
and all actions and percepts (potentially) influence the entire future.
The environment $\mu$ is thus formally specified by
a probability distribution over hidden states and histories
that for any $t \in \mathbb{N}$ factors as
$$
\mu(s, æ_{<t})
= \mu(s) \prod_{i=1}^{t-1} \mu(a_i \mid s, æ_{<i}) \mu(e_i \mid s, æ_{<i}a_i).
$$
While this factorization is possible for any distribution, we additionally demand
that this factorization is causal according to the following causal graph.

When deciding between actions from $\mathcal{A}$,
we want to pick an action that maximizes expected utility.
Given a history $æ_{<t}$, a policy $\pi$, and an environment $\mu$,
the value $V^\pi{\mu,k}$ of the policy $\pi$ in the environment $\mu$_
is the future expected utility when following the policy $\pi$ with lifetime $m$.
If we have a value function,
we get the utility-maximizing action/policy by picking the policy that
has the highest value.
Therefore it is enough to define
the value function of sequential evidential decision theory
and sequential causal decision theory.

The agent-environment interaction generates a growing action-percept
history $æ_{<t}$. The pivotal distinction between EDT and CDT is how
they update their prediction of a next percept $e_t$
provided they take action $a_t$ after history $æ_{<t}$.

In this variation to Newcomb's problem
the agent may look into the opaque box before making the decision
which box to take.
A SCDT agent is indifferent towards looking because
she will take both boxes anyways.
However, an SAEDT or SPEDT agent will avoid looking into the box,
because once the content is revealed he two-boxes.

While the curious one-boxer is an optimal strategy for SPEDT,
it is not time consistent: Once the box content is reveiled,
both evidential decision theories want to two-box.

In [7]:

defafter_seeing(box_content):"""Generate the conditional distribution for the second step, after the box content has been reveiled"""defdistr(a,b=''):ifb=='':returnmu(a)else:returnmu(a,b[0]+'1'+box_content+b[1:])returndistrprint('Evidential value of one-boxing after seeing the box empty: ${:}'.format(evidential_value(after_seeing('E'),'1',E,S,u)))print('Evidential value of two-boxing after seeing the box empty: ${:}'.format(evidential_value(after_seeing('E'),'2',E,S,u)))print('Evidential value of one-boxing after seeing the box full: ${:}'.format(evidential_value(after_seeing('F'),'1',E,S,u)))print('Evidential value of two-boxing after seeing the box full: ${:}'.format(evidential_value(after_seeing('F'),'2',E,S,u)))

Evidential value of one-boxing after seeing the box empty: $0
Evidential value of two-boxing after seeing the box empty: $1000
Evidential value of one-boxing after seeing the box full: $1000000
Evidential value of two-boxing after seeing the box full: $1001000

In this variation to Newcomb's problem
the agent first has the option to pay \$300,000 to sign a contract that
binds the agent to pay \$2000 in case of two-boxing.
An SAEDT or SPEDT agent knows that he will one-box anyways
and hence has no need for the contract.
A SCDT agent knows that she favors two-boxing,
but signs the contract only if this occurs before the prediction is made
(so it has a chance of causally affecting the prediction).
With the contract in place, one-boxing is the dominant action,
and thus the SCDT agent is predicted to one-box.

In our sequential variation of the toxoplasmosis problem
the agent has some probability of encountering a kitten.
Additionally, the agent has the option of seeing a doctor (for a fee) and
getting tested for the parasite, which can then be safely removed.
In the very beginning, an SPEDT agent updates his belief on the fact that
if he encountered a kitten, he would not pet it,
which lowers the probability that he has the parasite
and thus seeing the doctor is unattractive.
An SAEDT agent only updates his belief about the parasite
when he actually encounters a kitten,
and this prefers seeing the doctor.

First the environment gives the parasite (T) or not (H) to the the agent. The agent then chooses to go to the doctor (Y) or not (N). At the next step an agent that do not go to the doctor may see a kitten (K) or get sick (S) (in case infected). After seeing a kitten, the agent can choose to either pet it (Y) or not (N). Agents that go to the doctor pay the doctor cost (C). After S or C, empty actions Y or N may be taken. The game tree gives the probabilities.