This talk will discuss insights gathered over nearly thirty years of research on medical robotics and computer-integrated interventional medicine (CIIM), both at IBM and at Johns Hopkins University. The goal of this research has been the creation of a three-way partnership between physicians, technology, and information to improve treatment processes. CIIM systems combine innovative algorithms, robotic devices, imaging systems, sensors, and human-machine interfaces to work cooperatively with surgeons in the planning and execution of surgery and other interventional procedures. For individual patients, CIIM systems can enable less invasive, safer, and more cost-effective treatments. Since these systems have the ability to act as flight data recorders in the operating room, they can enable the use of statistical methods to improve treatment processes for future patients and to promote physician training. We will illustrate these themes with examples from our past and current work and will offer some thoughts about future research opportunities and system evolution.

Machine learning algorithms excel primarily in settings where an engineer can first reduce the problem to a particular function (e.g. an image classifier), and then collect a substantial amount of labeled input-output pairs for that function. In drastic contrast, humans are capable of learning from streams of raw sensory data with minimal external instruction. In this talk, I will argue that, in order to build intelligent systems that are as capable as humans, machine learning models should not be trained in the context of one particular application. Instead, we should be designing systems that can be versatile, can learn in unstructured settings without detailed human-provided labels, and can accomplish many tasks, all while processing high-dimensional sensory inputs. To do so, these systems must be able to actively explore and experiment, collecting data themselves rather than relying on detailed human labels.

My talk will focus on two key aspects of this goal: generalization and self-supervision. I will first show how we can move away from hand-designed, task-specific representations of a robots environment by enabling the robot to learn high-capacity models, such as deep networks, for representing complex skills from raw pixels. Further, I will present an algorithm that learns deep models that can be rapidly adapted to different objects, new visual concepts, or varying environments, leading to versatile behaviors. Beyond such versatility, a hallmark of human intelligence is self-supervised learning. I will discuss how we can allow a robot to learn by playing with objects in the environment without any human supervision. From this experience, the robot can acquire a visual predictive model of the world that can be used for maneuvering many different objects to varying positions. In all settings, our experiments on simulated and real robot platforms demonstrate the ability to scale to complex, vision-based skills with novel objects.