For example, the simulated environments can be used to teach robotic fingers to play an instrument, or pick and lift an object from the table. This is useful for folks interested in rapidly training intelligent robots over thousands of exercises, without having to rig up a relatively slow-moving physical bot, or before they have a chance to get hold of the hardware.

This Star Trek-style Holodeck approach is much faster and easier than training a robot in a physical environment – the resulting model can, of course, be later used to control a real-world machine, when it's ready.

Peter Welinder, a researcher at OpenAI, told The Register that “just as a real gym has different ‘environments’ – like a treadmill, a bench press, an exercise bike, and so on – the OpenAI Gym has environments for AI agents such as ‘make a toy figure walk’ or ‘make a car run up a slope.’"

Specifically, the latest environments simulate a Fetch robotic arm to push stuff around, and a ShadowHand to grip and manipulate things with robotic fingers.

All the new robotics environments are trained using sparse rewards. Typically, RL models are rewarded little by little as they get closer to their goal. The reward encourages the software, and indicates it is gradually learning to do the right thing. Sparse rewards, on the other hand, are only given when the code completes its goal.

Why, Robot? Understanding AI ethics

It's the difference between telling a computer to make a sandwich, and giving it rewards points for getting two slices of bread, then more points for grabbing some ham, then more points for layering them – and just giving points for the sandwich when it's done.

“Let's take the arm pushing the puck as an example," said Welinder. "It tries to do some motion randomly, like just hitting the puck from the side. In the traditional RL setting, an oracle would give the agent a reward based on how close to the goal the puck ends up. The closer puck to the goal, the bigger the reward. So, in a way, the oracle tells the agent ‘you're getting warmer.’

“Sparse rewards essentially pushes this paradigm to the limit: the oracle only gives a reward if the goal is reached. The oracle doesn't say 'you're getting warmer' anymore. It only says: "You succeeded," or "You failed." This is a much harder setting to learn in, since you're not getting any intermediate clues.”

Sparse reward learning is supposed to mirror the conditions for training robots in the real world. “For example, if I want my robot to pour wine into a glass I just tell it 'this is how much wine there should be in the glass,'" said Welinder. "I don't want to have to tell it 'first grab the bottle, then lift it up, tip it over the glass edge, pour until it reaches this level, hold for two seconds, stop.'"

To train robots via sparse reward, OpenAI has also released code called Hindsight Experience Replay (HER), an RL algorithm that learns by replaying and assessing its performance after attempting to complete a task.

Since the environments are open source, other developers can customize them to introduce new robot motions or different objects. OpenAI has also published a list of research ideas for developers interested in improving the HER algorithm, on page six of this technical report. ®