Continuous-time Markov decision processes are an important class of models in a
wide range of applications, ranging from cyber-physical systems to synthetic
biology. A central problem is how to devise a policy to control the system in
order to maximise the probability of satisfying a set of temporal logic
specifications. Here we present a novel approach based on statistical model
checking and an unbiased estimation of a functional gradient in the space of
possible policies. The statistical approach has several advantages over
conventional approaches based on uniformisation, as it can also be applied when
the model is replaced by a black box, and does not suffer from state-space
explosion. The use of a stochastic gradient to guide our search considerably improves
the efficiency of learning policies. We demonstrate the method on a
proof-of-principle non-linear population model, showing strong performance in a
non-trivial task.