Whether you are a photographer, a marketing manager, or a regular Internet user, chances are you have encountered visible watermarks many times. Visible watermarks are those logos and patterns that are often overlaid on digital images provided by stock photography websites, marking the image owners while allowing viewers to perceive the underlying content so that they could license the images that fit their needs. It is the most common mechanism for protecting the copyrights of hundreds of millions of photographs and stock images that are offered online daily.

It’s standard practice to use watermarks on the assumption that they prevent consumers from accessing the clean images, ensuring there will be no unauthorized or unlicensed use. However, in “On The Effectiveness Of Visible Watermarks” recently presented at the 2017 Computer Vision and Pattern Recognition Conference (CVPR 2017), we show that a computer algorithm can get past this protection and remove watermarks automatically, giving users unobstructed access to the clean images the watermarks are intended to protect.

Left: example watermarked images from popular stock photography websites. Right: watermark-free version of the images on the left, produced automatically by a computer algorithm. More results are available below and on our project page. Image sources: Adobe Stock, 123RF.

As often done with vulnerabilities discovered in operating systems, applications or protocols, we want to disclose this vulnerability and propose solutions in order to help the photography and stock image communities adapt and better protect its copyrighted content and creations. From our experiments much of the world’s stock imagery is currently susceptible to this circumvention. As such, in our paper we also propose ways to make visible watermarks more robust to such manipulations.

The Vulnerability of Visible Watermarks
Visible watermarks are often designed to contain complex structures such as thin lines and shadows in order to make them harder to remove. Indeed, given a single image, for a computer to detect automatically which visual structures belong to the watermark and which structures belong to the underlying image is extremely difficult. Manually, the task of removing a watermark from an image is tedious, and even with state-of-the-art editing tools it may take a Photoshop expert several minutes to remove a watermark from one image.

However, a fact that has been overlooked so far is that watermarks are typically added in a consistent manner to many images. We show that this consistency can be used to invert the watermarking process — that is, estimate the watermark image and its opacity, and recover the original, watermark-free image underneath. This can be all be done automatically, without any user intervention, and by only observing watermarked image collections publicly available online.

The consistency of a watermark over many images allows to automatically remove it in mass scale. Left: input collection marked by the same watermark, middle: computed watermark and its opacity, right: recovered, watermark-free images. Image sources: COCO dataset, Copyright logo.

The first step of this process is identifying which image structures are repeating in the collection. If a similar watermark is embedded in many images, the watermark becomes the signal in the collection and the images become the noise, and simple image operations can be used to pull out a rough estimation of the watermark pattern.

This provides a rough (noisy) estimate of the matted watermark (the watermark image times its spatially varying opacity, i.e., alpha matte). To actually recover the image underneath the watermark, we need to know the watermark’s decomposition into its image and alpha matte components. For this, a multi-image optimization problem can be formed, which we call “multi-image matting” (an extension of the traditional, single image matting problem), where the watermark (“foreground”) is separated into its image and opacity components while reconstructing a subset of clean (“background”) images. This optimization is able to produce very accurate estimations of the watermark components already from hundreds of images, and can deal with most watermarks used in practice, including ones containing thin structures, shadows or color gradients (as long as the watermarks are semi-transparent). Once the watermark pattern is recovered, it can be efficiently removed from any image marked by it.

Here are some more results, showing the estimated watermarks and example watermark-free results generated for several popular stock image services. We show many more results in our supplementary material on the project page.

Making Watermarks More Effective
The vulnerability of current watermarking techniques lies in the consistency in watermarks across image collections. Therefore, to counter it, we need to introduce inconsistencies when embedding the watermark in each image. In our paper we looked at several types of inconsistencies and how they affect the techniques described above. We found for example that simply changing the watermark’s position randomly per image does not prevent removing the watermark, nor do small random changes in the watermark’s opacity. But we found that introducing random geometric perturbations to the watermark — warping it when embedding it in each image — improves its robustness. Interestingly, very subtle warping is already enough to generate watermarks that this technique cannot fully defeat.

Flipping between the original watermark and a slightly, randomly warped watermark that can improve its robustness

This warping produces a watermarked image that is very similar to the original (top right in the following figure), yet now if an attempt is made to remove it, it leaves very visible artifacts (bottom right):

In a nutshell, the reason this works is because that removing the randomly-warped watermark from any single image requires to additionally estimate the warp field that was applied to the watermark for that image — a task that is inherently more difficult. Therefore, even if the watermark pattern can be estimated in the presence of these random perturbations (which by itself is nontrivial), accurately removing it without any visible artifact is far more challenging.

Here are some more results on the images from above when using subtle, randomly warped versions of the watermarks. Notice again how visible artifacts remain when trying to remove the watermark in this case, compared to the accurate reconstructions that are achievable with current, consistent watermarks. More results and a detailed analysis can be found in our paper and project page.

Left column: Watermarked image, using subtle, random warping of the watermark. Right Column: Watermark removal result.

This subtle random warping is only one type of randomization that can introduced to make watermarks more effective. A nice feature of that solution is that it is simple to implement and already improves the robustness of the watermark to image-collection attacks while at the same time being mostly imperceptible. If more visible changes to the watermark across the images are acceptable — for example, introducing larger shifts in the watermark or incorporating other random elements in it — they may lead to an even better protection.

While we cannot guarantee that there will not be a way to break such randomized watermarking schemes in the future, we believe (and our experiments show) that randomization will make watermarked collection attacks fundamentally more difficult. We hope that these findings will be helpful for the photography and stock image communities.

Acknowledgements
The research described in this post was performed by Tali Dekel, Michael Rubinstein, Ce Liu and Bill Freeman. We thank Aaron Maschinot for narrating our video.

Machine learning (ML) has become an increasingly powerful tool, one that can be applied to a wide variety of areas spanning object recognition, language translation, health and more. However, the development of ML systems is often restricted to those with computational resources and the technical expertise to work with commonly available ML libraries.

With PAIR — an initiative to study and redesign human interactions with ML — we want to open machine learning up to as many people as possible. In pursuit of that goal, we are excited to announce deeplearn.js 0.1.0, an open source WebGL-accelerated JavaScript library for machine learning that runs entirely in your browser, with no installations and no backend.

There are many reasons to bring machine learning into the browser. A client-side ML library can be a platform for interactive explanations, for rapid prototyping and visualization, and even for offline computation. And if nothing else, the browser is one of the world's most popular programming platforms.

While web machine learning libraries have existed for years (e.g., Andrej Karpathy's convnetjs) they have been limited by the speed of Javascript, or have been restricted to inference rather than training (e.g., TensorFire). By contrast, deeplearn.js offers a significant speedup by exploiting WebGL to perform computations on the GPU, along with the ability to do full backpropagation.

The API mimics the structure of TensorFlow and NumPy, with a delayed execution model for training (like TensorFlow), and an immediate execution model for inference (like NumPy). We have also implemented versions of some of the most commonly-used TensorFlow operations. With the release of deeplearn.js, we will be providing tools to export weights from TensorFlow checkpoints, which will allow authors to import them into web pages for deeplearn.js inference.

You can explore the potential of this library by training a convolutional neural network to recognize photos and handwritten digits — all in your browser without writing a single line of code.

We're releasing a series of demos that show deeplearn.js in action. Play with an image classifier that uses your webcam in real-time and watch the network’s internal representations of what it sees. Or generate abstract art videos at a smooth 60 frames per second. The deeplearn.js homepage contains these and other demos.

Our vision is that this library will significantly increase visibility and engagement with machine learning, giving developers access to powerful tools while simultaneously providing the everyday user with a way to interact with them. We’re looking forward to collaborating with the open source community to drive this vision forward.

Machine learning (ML) is a key strategic focus at Google, with highly active groups pursuing research in virtually all aspects of the field, including deep learning and more classical algorithms, exploring theory as well as application. We utilize scalable tools and architectures to build machine learning systems that enable us to solve deep scientific and engineering challenges in areas of language, speech, translation, music, visual processing and more.

As a leader in ML research, Google is proud to be a Platinum Sponsor of the thirty-fourth International Conference on Machine Learning (ICML 2017), a premier annual event supported by the International Machine Learning Society taking place this week in Sydney, Australia. With over 130 Googlers attending the conference to present publications and host workshops, we look forward to our continued colalboration with the larger ML research community.

If you're attending ICML 2017, we hope you'll visit the Google booth and talk with our researchers to learn more about the exciting work, creativity and fun that goes into solving some of the field's most interesting challenges. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Facets, neural audio synthesis with Nsynth, a Q&A session on the Google Brain Residency program and much more. You can also learn more about our research being presented at ICML 2017 in the list below (Googlers highlighted in blue).

As a leader in natural language processing & understanding and a Platinum sponsor of ACL 2017, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better systems using labeled and unlabeled data, state-of-the-art modeling and learning from indirect supervision.

If you’re attending ACL 2017, we hope that you’ll stop by the Google booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about the Google research being presented at ACL 2017 below (Googlers highlighted in blue).

Recently Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, presented a solution for virtual headset ‘removal’ for mixed reality in order to create a more rich and engaging VR experience. While that work could infer eye-gaze directions and blinks, enabled by a headset modified with eye-tracking technology, a richer set of facial expressions — which are key to understanding a person's experience in VR, as well as conveying important social engagement cues — were missing.

Today we present an approach to infer select facial action units and expressions entirely by analyzing a small part of the face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user’s eyes captured from an infrared (IR) gaze-tracking camera within a VR headset are sufficient to infer at least a subset of facial expressions without the use of any external cameras or additional sensors.

Left: A user wearing a VR HMD modified with eye-tracking used for expression classification (Note that no external camera is used in our method; this is just for visualization). Right: inferred expression from eye images using our model. A video demonstrating the work can be seen here.

We use supervised deep learning to classify facial expressions from images of the eyes and surrounding areas, which typically contain the iris, sclera, eyelids and may include parts of the eyebrows and top of cheeks. Obtaining large scale annotated data from such novel sensors is a challenging task, hence we collected training data by capturing 46 subjects while performing a set of facial expressions.

To perform expression classification, we fine-tuned a variant of the widespread Inception architecture with TensorFlow using weights from a model trained to convergence on Imagenet. We attempted to partially remove variance due to differences in participant appearance (i.e., individual differences that do not depend on expression), inspired by the standard practice of mean image subtraction. Since this variance removal occurs within-subject, it is effectively personalization. Further details, along with examples of eye-images, and results are presented in our accompanying paper.

Results and Extensions
We demonstrate that the information required to classify a variety of facial expressions is reliably present in IR eye images captured by a commercial HMD sensor, and that this information can be decoded using a CNN-based method, even though classifying facial expressions from eye-images alone is non-trivial even for humans. Our model inference can be performed in real-time, and we show this can be used to generate expressive avatars in real-time, which can function as an expressive surrogate for users engaged in VR. This interaction mechanism also yields a more intuitive interface for sharing expression in VR as opposed to gestures or keyboard inputs.

The ability to capture a user’s facial expressions using existing eye-tracking cameras enables a fully mobile solution to facial performance capture in VR, without additional external cameras. This technology extends beyond animating cartoon avatars; it could be used to provide a richer headset removal experience, enhancing communication and social interaction in VR by transmitting far more authentic and emotionally textured information.

Acknowledgements
The research described in this post was performed by Steven Hickson (as an intern), Nick Dufour, Avneesh Sud, Vivek Kwatra and Irfan Essa. We also thank Hayes Raffle and Alex Wong from Daydream, and Chris Bregler, Sergey Ioffe and authors of TF-Slim from Google Research for their guidance and suggestions.

Google is always interested in solving complex engineering problems, and few are more complex than fusion. Physicists have been trying since the 1950s to control the fusion of hydrogen atoms into helium, which is the same process that powers the Sun. The key to harnessing this power is to confine hydrogen plasmas for long enough to get more energy out from fusion reactions than was put in. This point is called “breakeven.” If it works, it would represent a technological breakthrough, and could provide an abundant source of zero-carbon energy.

There are currently several large academic and government research efforts in fusion. Just to rattle off a few, in plasma fusion there are tokamak machines like ITER and stellarator machines like Wendelstein 7-X. The stellarator design actually goes back to 1951, so physicists have been working on this for a while. Oh yeah, and if you like giant lasers, there’s the National Ignition Facility which users lasers to generate X-rays to generate fusion reactions. So far, none of these has gotten to the economic breakeven point.

Yeah. Tri Alpha Energy has a unique scheme for plasma confinement called a field-reversed configuration that’s predicted to get more stable as the energy goes up, in contrast to other methods where plasmas get harder to control as you heat them. Tri Alpha built a giant ionized plasma machine, C-2U, that fills an entire warehouse in an otherwise unassuming office park. The plasma that this machine generates and confines exhibits all kinds of highly nonlinear behavior. The machine itself pushes the envelope of how much electrical power can be applied to generate and confine the plasma in such a small space over such a short time. It’s a complex machine with more than 1000 knobs and switches, an investment (not ours!) in exploring clean energy north of $100 million. This is a high-stakes optimization problem, dealing with both plasma performance and equipment constraints. This is where Google comes in.

End-on view of C-2U

Wait, why not just simulate what will happen? Isn’t this simple physics?

The “simple” simulations using magnetohydrodynamics don’t really apply. Even if these machines operated in that limit, which they very much don’t, the simulations make fluid dynamics simulations look easy! The reality is much more complicated, as the ion temperature is three times larger than the electron temperature, so the plasma is far out of thermal equilibrium, also, the fluid approximation is totally invalid, so you have to track at least some of the trillion+ individual particles, so the whole thing is beyond what we know how to do even with Google-scale compute resources.

So why are we doing this? Real experiments! With atoms not bits! At Google we love to run experiments and optimize things. We thought it would be a great challenge to see if we could help Tri Alpha. They run a plasma “shot” on the C-2U machine every 8 minutes. Each shot consists of creating two spinning blobs of plasma in the vacuum sealed innards of C-2U, smashing them together at over 600,000 miles per hour, creating a bigger, hotter, spinning football of plasma. Then they blast it continuously with particle beams (actually neutral hydrogen atoms) to keep it spinning. They hang on to the spinning football with magnetic fields for as long as 10 milliseconds. They’re trying to experimentally verify that these advanced beam driven field-reversed plasma configurations behave as expected by theory. If they do, this scheme could lead to net-energy-out fusion.

Now 8 minutes sounds like a long time (which is the time it takes for C-2U to cool, recharge, and get ready for another 10 ms shot), but when you’re sitting in the control room during an experimental campaign, it goes by really quickly. There are a lot of sensor outputs to look at, to try to figure out how the plasma was behaving. Before you know it, the power supplies are charged again, and they’re ready for another go!

What was that about optimization? What are you optimizing?

That’s the thing, it’s not completely obvious what good plasma performance is. Of course, Tri Alpha has some of the world’s best plasma physicists, but even they disagree on what “good” is. We can boil down the machine controls to “only” 30 parameters or so, but when you have to wait 8 minutes per experiment, it’s a pretty hard problem even with a concrete objective. Also, it’s not entirely known, day-to-day, what the reliable operating envelope of the machine is. And it keeps changing since the quality of the vacuum keeps changing and electrodes wear out and...

So we boil the problem down to “let’s find plasma behaviors that an expert human plasma physicist thinks are interesting, and let’s not break the machine when we’re doing it.” We developed the Optometrist Algorithm, which is sort of a Markov Chain Monte Carlo (MCMC) where the likelihood function being explored is in the plasma physicist’s mind rather than being explicitly written down. Just like getting an eyeglass prescription, the algorithm presents the expert human with machine settings and the associated outcomes. They can just use their judgement on what is interesting, and what is unhealthy for the machine. These could be “That initial collision looked really strong!” or “The edge biasing is actually working well now!” or “Wow, that was awesome, but the electrode current was way too high, let’s not do that again!” The key improvement we provided was a technique to search the high-dimensional space of machine parameters efficiently.

Oh, I like MCMC, it’s like the best thing ever!

I knew you’d like that bit. Using this technique, we actually found something really interesting. As we describe in our paper, we found a regime where the neutral particle beams dumping energy into the plasma were able to completely balance the cooling losses, and the total energy in the plasma actually went up after formation. It was only for about 2 milliseconds, but still, it was a first! Since rising energy due to neutral beam heating was not necessarily expected for C-2U, it would have been difficult to plug into an objective function. We really needed a human expert to notice. This was a classic case of humans and computers doing a better job together than either could have separately. You know how it is — when you think you have an optimization problem, and you optimize the objective, you usually just look at the result and say, “No no no, that’s not what I meant,” and you add some other term and repeat until you get sick of it?

That hasn’t happened to me. This week. Yet.

Yeah, so we just cut out that iteration and let the expert human use their judgment. This learning from human preferences is becoming a thing. Google and Tri Alpha made a pretty good team for it, for a really important problem.

So what now?

So actually, Tri Alpha learned everything they could have from C-2U and then dismantled it. They built a new machine called Norman (after their late co-founder Norman Rostoker) in the same warehouse. It’s much more powerful both in plasma acceleration and in neutral particle beams. It also has a more sophisticated system to confine the plasma in the central region. The pressure vessel, accelerators, and banks of capacitors and power supplies cover the building’s concrete floor.

They just achieved “first plasma” on it. They’re hoping, with our help, to verify this theoretical prediction that the plasma will actually behave better in the “burning plasma” regime. If they can do that over the next 18 months, it will be a lot more likely that the field-reversed configuration is a viable approach for breakeven fusion. In that case, Tri Alpha will try to build their follow-on design, an actual demonstration power generator. That one won’t fit in their warehouse!

Acknowledgements
On the Google side, we wish to thank John Platt, Michael Dikovsky, Patrick Riley and Ross Koningstein for their significant contributions to this work. We thank the Google Accelerated Science team for their continual support. We are also grateful to the entire team at Tri Alpha for giving us the opportunity to try our hand at optimization for this crucially important problem.

Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means.

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.

Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.

Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.

After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.

Learning progression.

Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations.

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.

Self-supervised human pose imitation by a robot.

A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linear joints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.”

In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).

To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous post and prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.

The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.

A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:

Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understanding, robotic perception, grasping, and imitation learning has considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

AcknowledgementsThe research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.