Arxiv on Feb. 26th

Real-time bidding (RTB) is almost the most important mechanism in online
display advertising, where proper bid for each page view plays a vital and
essential role for good marketing results. Budget constrained bidding is a
typical scenario in RTB mechanism where the advertisers hope to maximize total
value of winning impressions under a pre-set budget constraint. However, the
optimal strategy is hard to be derived due to complexity and volatility of the
auction environment. To address the challenges, in this paper, we formulate
budget constrained bidding as a Markov Decision Process. Quite different from
prior model-based work, we propose a novel framework based on model-free
reinforcement learning which sequentially regulates the bidding parameter
rather than directly producing bid. Along this line, we further innovate a
reward function which deploys a deep neural network to learn appropriate reward
and thus leads the agent to deliver the optimal policy effectively; we also
design an adaptive $\epsilon$-greedy strategy which adjusts the exploration
behaviour dynamically and further improves the performance. Experimental
results on real dataset demonstrate the effectiveness of our framework.
( https://arxiv.org/abs/1802.08365 , 1343kb)

Single document summarization is the task of producing a shorter version of a
document while preserving its principal information content. In this paper we
conceptualize extractive summarization as a sentence ranking task and propose a
novel training algorithm which globally optimizes the ROUGE evaluation metric
through a reinforcement learning objective. We use our algorithm to train a
neural summarization model on the CNN and DailyMail datasets and demonstrate
experimentally that it outperforms state-of-the-art extractive and abstractive
systems when evaluated automatically and by humans.
( https://arxiv.org/abs/1802.08636 , 50kb)

Despite single agent deep reinforcement learning has achieved significant
success due to the experience replay mechanism, Concerns should be reconsidered
in multiagent environments. This work focus on the stochastic cooperative
environment. We apply a specific adaptation to one recently proposed weighted
double estimator and propose a multiagent deep reinforcement learning
framework, named Weighted Double Deep Q-Network (WDDQN). To achieve efficient
cooperation, \textit{Lenient Reward Network} and \textit{Mixture Replay
Strategy} are introduced. By utilizing the deep neural network and the weighted
double estimator, WDDQN can not only reduce the bias effectively but also be
extended to many deep RL scenarios with only raw pixel images as input.
Empirically, the WDDQN outperforms the existing DRL algorithm (double DQN) and
multiagent RL algorithm (lenient Q-learning) in terms of performance and
convergence within stochastic cooperative environments.
( https://arxiv.org/abs/1802.08534 , 1614kb)

In recent years, Deep Reinforcement Learning has made impressive advances in
solving several important benchmark problems for sequential decision making.
Many control applications use a generic multilayer perceptron (MLP) for
non-vision parts of the policy network. In this work, we propose a new neural
network architecture for the policy network representation that is simple yet
effective. The proposed Structured Control Net (SCN) splits the generic MLP
into two separate sub-modules: a nonlinear control module and a linear control
module. Intuitively, the nonlinear control is for forward-looking and global
control, while the linear control stabilizes the local dynamics around the
residual of global control. We hypothesize that this will bring together the
benefits of both linear and nonlinear policies: improve training sample
efficiency, final episodic reward, and generalization of learned policy, while
requiring a smaller network and being generally applicable to different
training methods. We validated our hypothesis with competitive results on
simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban
driving environment, with various ablation and generalization tests, trained
with multiple black-box and policy gradient training methods. The proposed
architecture has the potential to improve upon broader control tasks by
incorporating problem specific priors into the architecture. As a case study,
we demonstrate much improved performance for locomotion tasks by emulating the
biological central pattern generators (CPGs) as the nonlinear part of the
architecture.
( https://arxiv.org/abs/1802.08311 , 2770kb)