These days, I am studying the following points in reinforcement learning:

function approximators to approximate the value function and/or a policy. I concentrate on non parametric approximators.

direct policy search

ways to help a reinforcement learner

obtain a general purpose, efficient software toolbox to solve the reinforcement learning problem.

Very brief background on reinforcement learning

In the reinforcement problem, an agent learns how to behave in its
environment. This is an extremely difficult problem, far from being
solved in general.
Reinforcement learning takes its roots in the study of the dynamics of
the behavior of living beings. In 1896, Thorndike formulates the
law of effect which basically states that, in any animal, the
probability of emission of a behavior that is followed by favorable
consequences increases, and conversely, if it is favorable by bad
consequences, its probability of emission decreases. This principle
seems obvious to most of us; however, when followed accurately to its
diverse consequences, lots of us disagree with them. Anyway, this may
be the topic of excellent conversations but, as a researcher in
computer science, this goes beyond my expected professional abilities.

As a computer scientist, my interest is that this very simple law of
the dynamics of behavior gives some insight to how one might try to
solve the reinforcement learning problem.

How I got involved in reinforcement learning

Actually, I got interested in reinforcement learning thanks to Samuel Delepoulle
whom I had the pleasure to advise during the completion of his PhD in
psychology. He had this excellent idea to use reinforcement learning
algorithm (Q-learning to be precise) to model an arm acquiring a
reaching movement (see
this paper and my publication
page for more details on this work).
To be a little bit more precise, the arm is made of 2 segments and
each joint (the shoulder and the elbow) is controlled by a couple of
antagonistic muscles. Each muscle is actually controlled by a
Q-learning. This set of Q-learners has to learn to behave in order to
reach a certain target with the hand. The nice thing is that there is
absolutely no overall control of the 4 muscles and, still, the task is
performed.
We then tried to simulate things that are more elaborate such as two
arms connected to a trunk (10 Q-leaners in this case). Again, we
succeeded and an other nice thing here is that despite this 5-fold
increase in the number of agents, the time (wall-clock) for learning
the task was approximately the same, which means that each agents
learnt this task 5 times faster than the agents of the single arm!
We continue on this line of work, aiming at simulating eye tracking
(see DYNAPP
project). In this ongoing work, we use function approximator rather
than crude tabular algorithms.

What I like in reinforcement learning (in machine learning in
general), is that things may be quite nicely formalized. It often
becomes very quickly very technical mathematically speaking, but
still, intellectually, that's nice. Having been trained as a mere
computer scientist, these mathematical developments generally lead me
right to things I have never been taught during my studies. But at
least, this formal aspect provides a path along which we are guided
towards good practices, good ways of thinking (would we also get
trapped in these ways of thinking... I do not know).
For a certain amount of time (and this effort still goes on), I have
thus been doing my best effort to understand some of the mathematics
underlying machine learning in general, and reinforcement learning in
particular. Things are made quite more complicated (but intellectually
beautified) by the fact that different branches of maths are required:
statistics, stochastic processes, functional analysis to name the most
important. This effort led me to go beyond the superficialities of
so-called neural netowrks and other exotic things of computer science,
and concentrate on what these things really are, that is, what these
things are as computing machineries, understand why they are prone to
do things that are different from what we want them to, and how to
remedy these apparent deficiencies (as I keep on telling to my
students: a computer is always, and only, doing what you are asking it
to do).