Download

Abstract

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a
range of reinforcement learning tasks. Such tasks are either episodic---i.e., conducted in unconnected episodes of activity
that often end in either goal or failure states---or continuing---i.e., indefinitely ongoing. Another point of difference
is whether the learning agent highly discounts the value of future reward---a myopic agent---or conversely values future reward
appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation. This
study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and
episodic.In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity,
optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first
experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical
issues present in episodic tasks with generally positive reward and---relatedly---enables highly successful learning with
non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call ``VI-TAMER'', is
it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such
non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems,
we perform two subsequent user studies---one with a failure state added---that compare (1) learning when states are updated
asynchronously with local bias---i.e., states quickly reachable from the agent's current state are updated more often than
other states---to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally
biased updates, we find that it the general positivity of human reward creates problems even for continuing tasks, revealing
a distinct research challenge for future work.

BibTeX Entry

@InProceedings{iui13-knox,
author = {W. Bradley Knox and Peter Stone},
title = {Learning Non-Myopically from Human-Generated Reward},
booktitle = {In Proceedings of the International Conference on Intelligent User Interfaces (IUI)},
location = {Santa Monica, California},
month = {March},
year = {2013},
abstract = {Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic---i.e., conducted in unconnected episodes of activity that often end in either goal or failure states---or continuing---i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward---a myopic agent---or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.
In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and---relatedly---enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call ``VI-TAMER'', is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies---one with a failure state added---that compare (1) learning when states are updated asynchronously with local bias---i.e., states quickly reachable from the agent's current state are updated more often than other states---to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that it the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.
},
b2html_pubtype = {Refereed Conference},
wwwnote={<a href="http://www.iuiconf.org/">IUI</a>
}
}