How Google Wants to Solve Robotic Grasping by Letting Robots Learn for Themselves

You are likely pretty good at picking things up. That’s nice. Part of the reason that you’re pretty good at picking things up is that when you were little, you spent a lot of time trying and failing to pick things up, and learning from your experiences. For roboticists who don’t want to wait through the equivalent of an entire robotic childhood, there are ways to streamline the process: at Google Research, they’ve set up more than a dozen robotic arms and let them work for months on picking up objects that are heavy, light, flat, large, small, rigid, soft, and translucent (although not all at once). We talk to the researchers about how their approach is unique, and why 800,000 grasps (!) is just the beginning.

Part of what makes animals so good at grasping things are our eyes, as opposed to just our hands. You can grab stuff with your eyes closed, but you’re much better at it if you watch your hand interacting with the object that you’re trying to pick up. In robotics, this is referred to as visual servoing, and in addition to improving grasping accuracy, it makes grasping possible when objects are moving around or changing orientation during the grasping process, a very common thing to have happen in those pesky “real-world situations.”

Image: Google ResearchOne of the robotic manipulators used in the data collection experiments. Each unit consisted of a 7-degree-of-freedom arm with a 2-finger gripper, and a camera mounted over the shoulder of the robot. The researchers say the camera recorded monocular RGB and depth images, but only the monocular RGB images were used for grasp success prediction.

Teaching robots this skill can be tricky, because there aren’t necessarily obvious connections between sensor data and actions, especially if you have gobs of sensor data coming in all the time (like you do with vision systems). A cleverer way to do it is to just let the robots learn for themselves, instead of trying to teach them at all. At Google Research, a team of researchers, with help from colleagues at X, tasked a 7-DoF robot arm with picking up objects in clutter using monocular visual servoing, and used a deep convolutional neural network (CNN) to predict the outcome of the grasp. The CNN was continuously retraining itself (starting with a lot of fail but gradually getting better), and to speed the process along, Google threw 14 robots at the problem in parallel. This is completely autonomous: all the humans had to do was fill the bins with stuff and then turn the power on.

“In essence, the robot is constantly predicting, by observing the motion of its own hand, which kind of subsequent motion will maximize its chances of success. The result is continuous feedback: what we might call hand-eye coordination. Observing the behavior of the robot after over 800,000 grasp attempts, which is equivalent to about 3000 robot-hours of practice, we can see the beginnings of intelligent reactive behaviors. The robot observes its own gripper and corrects its motions in real time. It also exhibits interesting pre-grasp behaviors, like isolating a single object from a group. All of these behaviors emerged naturally from learning, rather than being programmed into the system.”

With 14 robots all working on this problem, a lot of data get collected a lot faster, but at the same time, a lot of unintentional variation gets introduced into the experiment. Cameras are positioned slightly differently, lighting is a bit different for each robot, and each of the compliant, underactuated two-finger grippers exhibits different types of wear, affecting performance:

Image: Google Research

What the grippers of the robots used for data collection looked like at the end of the experiments. The researchers say the robots “experienced different degrees of wear and tear, resulting in significant variation in gripper appearance and geometry.”

The upside to this is that the robots end up with a tolerance for things like minor hardware variation and camera calibration differences, making the grasping as a whole more robust. Even so, this method can’t be generalized too much, and is unlikely to work on significantly different hardware or in different grasping environments (like trying to pick stuff up off of a shelf). In future work, the researchers plan to explore increasing the diversity of the training setup to see how much more adaptable their technique can get. They’d also like to investigate how this method could be applied to “real world” robots that are “exposed to a wide variety of environments,
objects, lighting conditions, and wear and tear.”

For more info, we spoke with Sergey Levine at Google Research about what they’ve been working on:

Sergey Levine: Like Dex-Net and the work at Brown, our work is predicated on the hypothesis that large datasets will have a transformative effect on robot capability. The principal difference between our work and these efforts is that we take a very direct and data-driven approach to a specific robotic problem—grasping—with minimal prior knowledge. Dex-Net uses a model-based approach and simulated data, while the Brown Million Objects Challenge has the substantially broader aim of collecting scans of a large number of objects (our approach doesn’t aim to collect scans, simply to learn to grasp from experience).

Why was the volume of data important, and what (if anything) could you have learned with more data?

We used between six and 14 arms at any given time (the number increased over the course of the experiment as more robots came online). We are still working to formally determine how much data is actually needed, but anecdotally, things started to pick up after about 200,000 grasps, and continued to improve up to 800,000 grasps (and seem likely to improve further with more data).

The volume is important for two reasons: (1) there are many possible geometric configurations of objects and grippers that are possible (2) additional data was always collected using the latest model, which was effective at picking out precisely those situations where the latest model was confident but incorrect, and therefore appending samples to the dataset that could improve the latest model further.

How does your hardware design affect the technique (and success) of grasping objects? Why did you choose this particular gripper, and can the approach be adapted to any gripper?

The approach is straightforward to apply to any parallel jaw gripper, and can likely be adapted to other grippers and hands. The hardware was not designed specifically for this task, it was just the easiest hardware for us to get access to at the required volume. That said, the particular fingers we used with our gripper are well suited for picking various objects.

How can this work be generalized so that the technique could be useful to other manipulators in other environments?

It is likely that, in order to generalize to other manipulators, the system must be trained with a variety of manipulators and end effectors in order to achieve generalization. The current system is a proof of concept. A practical application is likely to require more extensive training in a variety of environments, with a variety of backgrounds, and possibly in other settings (on shelves, in drawers, etc), as well as a mechanism for higher-level direction to choose what to grasp, perhaps by constraining the sampled motor commands to specific parts of the workspace.

You can read a preprint of the paper “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection” by Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen, on arXiv.