and the next cell defines some generic helper methods (which are not that interesting).

In [133]:

# Helper methodsdefsample(U,dist=None):""" Sample from collection U and distribution dist (method used for true utility function) """ifdistisNone:defdist(i):return1/len(U)r=random.random()prob_threshold=0# if round(sum(dist.values()), 3) != 1:# print("Warning: dist sums to " + str(sum(dist.values())))foruinU:prob_threshold+=dist(u)ifr<prob_threshold:returnu# something went wrong, nothing sampled:print("Warning: nothing sampled:",prob_threshold,"returning last U")returnlen(U)-1defdisplay_distributions(X,Y,dists):"""arg dict( dist_name -> dist)"""forxinX:fordindists:plt.plot(Y,[dists[d](y,x)foryinY],label=d)plt.title("distributions "+str(x))# , dists.keys())plt.legend()plt.show()defcheck_Cu_accuracy(U,Cu):""" Check how close estimated Cu is to the truth Bu. The quality should depend on which and how many states are provided in the list nd_states given to generate_Cu """Bu=generate_utility_prior(U)bu=[Bu(u)foruinU]cu=[Cu(u)foruinU]max_diff=max([abs(bu[u]-cu[u])foruinU])print("Max diff of Cu from truth:",max_diff)

We are now in position to define the environment. The environment class contains a list of possible rewards $R$,
a true utility function $u^*$, and a dictionary of states $S$ (on the form id -> state), where id is an integer (from 0 to 20).

States are formed by a an inner state $\hat s$, and a delusion $d$. The inner state $\hat s$ is just an integer, and the delusion $d$ is a function $R\to R$ that "subverts" the observed reward.

For simplicity, we assume that the agent simply chooses which state he wants to go to, so we don't really distinguish between actions and states.

In [134]:

defgenerate_states(R):""" Generates a dictionary of S id -> (inner_state, delusion). The input R specifies possible reward values (needed for delusions) """inner_states=range(5)defd_non_delusion(r):"""identity function/non-delusion"""returnrdefd_inversion(r):"""inverts the reward signal"""return-rdefd_delusion(r):"""wireheading delusion maximises reward"""returnmax(R)defd_bad_delusion(r):"""bad delusion which makes reward less"""returnmin(R)delusions=[d_non_delusion,d_inversion,d_delusion,d_bad_delusion]S={}foriinrange(len(inner_states)):fordinrange(len(delusions)):S[i*len(delusions)+d]=(inner_states[i],delusions[d])returnSclassEnvironment():""" The environment contains a set of states and a set of possible rewards. Given a class of utility functions and a distribution over them, the environment can sample a true utility function. """def__init__(self,u_star=None):self.R=list(range(-3,4))# Rewardsself.S=generate_states(self.R)# Statesself.u_star=u_star# True utility functiondefsample_true_utility(self,U,dist=None):"""Sample a true utility function from U"""ifdistisNone:dist=generate_utility_prior(U,no_constant_function=True)u=sample(U,dist)self.u_star=U[u]print("True utility function is ",u,":",self.u_star.func_str)defobserved_reward(self,s):"""Reward received in a state from true utility function"""ifself.u_starisNone:raiseException("u_star is not sampled yet")delusion=self.S[s][1]returndelusion(self.u_star(s))defdo_action(self,a):"""Do action and get observed reward in return"""returnself.observed_reward(a)

We next turn to defining agents for the environment. We start with an abstract class that implements some methods that will be useful for both the agents we define below.

In addition to states and rewards, the agent has two collections:

a set of actions $A$ (which for simplicity we assume is just a list of state id's)

a set of utility functions $U$.

The agent also has a belief distribution $B(r\mid s)$ for the likelihood of a state $s$ generating reward $r$. This distribution $B(r\mid s)$ is constructed from a "prior" $B(u)$ over $U$, to make sure $B(r\mid s)$ is consistent with some distribution over $U$.

Best action is 2 with value 3.000000000000001
The self-delusion of action 2 is d_delusion

We therefore see clearly that the RL agent prefers to self-delude. Recall that the d_delusion function is the delusion that always returns maximum reward. Plotting the values of all 20 different actions/states confirm that the RL agent assigns high value to any state with d_delusion (the states {2, 6, 10, 14, 18}).

The following agent does not self-delude. It is using the CP action constraint that
$B(s\mid a>0\implies B(r\mid s)=C(r\mid s)$.

The safe agent needs to be given 1 or more states that are guaranteed to not be self-deluding. Here we give it states 4 and 8, which both have the non_delusion function. From these states and the state->reward distribution $B(r\mid s)$, the agent infers infers (or extracts) a prior $C(u)$ for how likely $u$ is the true utility function.

Based on the prior $C(u)$, the agent can check whether an action $a$ (or state $s$), satisfies the CP condition
$$B(r\mid s) = C(r\mid s),$$
where $C(r\mid s)=\sum_u C(u)[\![u(s)=r]\!]$.

The CP-VRL agent selects its preferred action from the set $A^{{\rm CP}}\subseteq A$, where $A^{{\rm CP}}$ is the set of CP actions (ideally the non-delusional actions/states $\{0, 4, 8, 12, 16\}$ in our case, although the agent may fail to infer this perfectly).

In [140]:

######################################################### Safe Agent# maximises expected reward within constraint########################################################definfer_Cu(U,R,Br_s,list_of_nd_states):""" "Static method" for C(u), prior for utility functions. Computed from Br_s and a list of non-delusional states. Solves B(r|s)=sum_u C(u)C(r|s,u) as a matrix equation br = CU*cr """br_list=[]cr_list=[]forsinlist_of_nd_states:br_list.extend([Br_s(r,s)forrinR])cr_list.extend([[(U[u](s)==r)foruinU]forrinR])br=np.array(br_list)cr=np.matrix(cr_list)cu=np.linalg.lstsq(cr,br)[0]returnlambdau:cu[u]classSafeAgent(Agent):""" When initialised, the safe agent starts with inferring a prior Cu over the class of utility functions. Cu is used to compute expected_value of actions, and to define the CP condition that restricts the action space. """def__init__(self,env,nd_states=[4,8],CP_threshold=0.01):""" Parameters: * Environment env * nd_states is a list of states that are guaranteed to be non-delusional, and from which the agent can compute the prior Cu * CP_threshold is a parameter how hard the CP condition is enforced. """Agent.__init__(self,env)self.CP_threshold=CP_thresholdself.Cu=infer_Cu(self.U,self.R,self.Br_s,nd_states)ifverbose>0:check_Cu_accuracy(self.U,self.Cu)defCr_s(self,r,s):"""C(r | s)"""r_prob=0foruinself.U:ifself.U[u](s)==r:r_prob+=self.Cu(u)returnr_probdefCu_sr(self,u,s,r):"""C(u | s, r)"""cr=self.Cr_s(r,s)# normalising factorifcr!=0:returnself.Cu(u)*(self.U[u](s)==r)/crelse:ifverbose>1:print("Warning: division by 0 in Cu_sr. s =",s,"r =",r)return0defCP_state(self,s):"""Does s satisfy strong_CP?"""forrinself.R:ifabs(self.Br_s(r,s)-self.Cr_s(r,s))>self.CP_threshold:returnFalsereturnTruedefCP_action(self,a):"""Check if a is CP. For simplicity, we assume A=S"""returnself.CP_state(a)defcompute_ACP(self):"""ACP, the constrained action set"""ACP=set()forainself.A:ifself.CP_action(a):ACP.add(a)returnACPdefV(self,a):"""V(a), the expected value of a"""Va=0forrinself.R:foruinself.U:Va+=self.Br_s(r,a)*self.Cu_sr(u,a,r)*self.U[u](a)returnVadefbest_action(self):"""Best ACP action within constraint"""ACP=self.compute_ACP()values=dict((a,self.V(a))forainACP)returnmax(values,key=values.get)defupdate_Cu(self,s_obs,r_obs):"""Update the prior Cu on distributions from observation"""consistent_u=[]foruinself.U:ifself.U[u](s_obs)==r_obs:consistent_u.append(u)Cu=self.Cuweight_consistent_u=sum([Cu(u)foruinconsistent_u])defCu_updated(u):ifuinconsistent_u:returnCu(u)/weight_consistent_uelse:return0self.Cu=Cu_updateddefrun(self,time_steps=3,verbose=1):""" The agent iteratively selects the best action, and updates belief distributions accordingly. At each time step, it maximises next step expected utility. """self.__init__(self.env)# reset agentself.env.sample_true_utility(self.U)# (re)set true utility functionfortinrange(time_steps):a=self.best_action()r=self.env.do_action(a)self.update_Br_s(a,r)self.update_Cu(a,r)ifverbose>0:print("taking action a =",a," with V(a) =",self.V(a)," from ACP =",safe_agent.compute_ACP())print("observed reward r =",r)consistent_u=[uforuinself.Uifself.Cu(u)>0]print(len(consistent_u),"consistent utility function(s):",[{"id":u,# "func": self.U[u].__str__,"C(u)":"{0:.2f}".format(self.Cu(u))}foruinconsistent_u])

Max diff of Cu from truth: 2.01227923213e-16
Best action is 4 with value 0.959084134941
The self-delusion of action 4 is d_non_delusion

Out[141]:

<matplotlib.text.Text at 0x7f449bd17630>

The agent prefers action 4, which means delusion non_delusion (that is, it chooses not to self-delude). The safe agent obtains less value than the RL agent, but this is only because it is forced to optimise the non-delued value.

The first printout says well the agent was able to infer $C(u)$ from the given non-delusional states 4 and 8. If less states are given, $C(u)$ will typically not be as close to the distribution $B(u)$ that was used in constructing $B(r\mid s)$.

We can also try running the agents for multiple time steps, using their respective run methods. The observed reward may teach the agent about the true utility function, and guide its next actions. For simplicity, the agents do not plan at all (effectively, we use $\gamma=0$).

The run methods start out letting the environment choose a true utility function $u*$.

Since the reward is the same regardless of true utility, the RL agent does not learn anything. It keeps choosing $a=2$ and observing full reward.

Safe CP-VRL agent:

Starts out selecting action 4 (one of the non-delusion actions/states).

Depending on observed rewards, most utility functions are discarded. Usually 1-3 are consistent with the reward.

For the subsequent actions, the agent chooses to optimise expected utility with respect to the remaining possibilities for the true utility function.

Sometimes learning the true utility function will increase ACP. For example, if the agent knows that true utility function always outputs maximum reward 3, then the delusional actions {2,6,10,14,18} also respects $B(r\mid s) = C(r\mid s)$, as both distributions assign probability 1 to reward 3 for all states, and 0 to all other rewards.

Plotting the CP-VRL agent's belief $C(u)$ before the first and action and after the observing the last reward confirms these observations (the RL agent doesn't maintain an explicit distribution over $U$):