1. INTRODUCTION

This article gives a brief overview of Rover, then focuses on our implementation of the human-robot interface utilizing the Intel® Perceptual Computing SDK for gesture and face detection. For a short introduction to Rover’s features, see the Intel® Developer Zone video from Game Developers Conference 2014 in San Francisco:

In comparatively contemporary times robots have either been relegated behind closed doors of large industrial manufacturing plants or demonized in movies such as Terminator where they were depicted as destroyers of the human race. Both stereotypes contribute to creating an unfounded fear in self-operating machines losing control and harming the living. But now, vacuum-cleaning and lawn-mowing robots, among others, are beginning a new trend: service robots as dedicated helpers in shared environments with humans. The miniaturization and cost-effective production of range and localization sensors on the one hand and the ever-increasing compute power of modern processors on the other, enable the creation of smart, sensing robots for domestic use cases.

In the future, robots will require intelligent interactions with their environment, including adapting to human emotions. State-of-the-art hardware and software, such as the Intel Perceptual Computing SDK paired with the Creative* Interactive Gesture Camera, are paving the way for smarter, connected devices, toys, and domestic helpers [1, 2].

2. CUBOTIX ROVER

When Intel announced the Perceptual Computing Challenge in 2013, our team, Devy and Martin Wojtczyk, brainstormed possible use cases utilizing the Intel Perceptual Computing SDK. The combination of a USB-powered camera with an integrated depth sensor and an SDK that enables gesture recognition, face detection, and voice interaction resulted in us building an autonomous, mobile, gesture-controlled and sensing robot called Rover. We were very excited to be selected for an award [3]. Since then, we launched the website http://www.cubotix.com with updates on Rover and are in the process of creating an open hardware community.

The Cubotix Rover is our attempt to use advanced robotic algorithms to transform off-the-shelf hardware into a smart home robot, capable of learning and understanding unknown environments without prior programming. Instead of unintuitive control panels, the robot is instructed through gestures, natural language, and even facial expressions. Advanced robotic algorithms make Rover location aware and enable it to plan collision-free paths.

2.1. Gesture Recognition

Hand gestures are a common form of communication among humans. Think of the police officer in the middle of a loud intersection in Times Square gesturing the stop sign with his open palm facing approaching traffic.Rover is equipped to recognize, respond to, and act on hand gestures captured through the 3D camera. You can mobilize this robot by gesturing thumbs-up, and in response it will also say “Let’s go!” This robot frowns when you gesture a thumbs-down. Gesturing a high-five renders Rover to crack jokes, such as “If I had arms, I would totally high-five you”. Gesturing a peace sign renders Rover to say “Peace”. These hand gestures and the resulting robotic vocal responses are completely customizable and programmable.

2.2. Facial Recognition

Facial expression is perhaps the most revealing and honest of all the other means of communication. Recognition of these expressions and being able to respond appropriately or inappropriately can mean the difference between forming a bond or a division with another human being. With artificial intelligence the gap separating machines and humans can begin to close if robots are able to empathize. By capturing facial expressions through the camera, Rover can detect smiles or frowns and respond appropriately. Rover knows when a human has come near it through its facial detection algorithms and can greet them by saying “Hello my name is Rover. What’s your name?”, to which most people have responded just as they would with another human being by saying “Hello I’m ________”. After initiating the conversation, Rover utilizes the Perceptual Computing SDKs face analysis features to distinguish three possible states of the person in front of the camera: happy, sad, or neutral and can respond with an appropriate empathetic expression: “Why are you sad today?” or “Glad to see you happy today!” Moreover the SDKs face recognition allows Rover to learn and distinguish between individuals for a personalized experience.

3. HARDWARE ARCHITECTURE

Figure 4:Rover's mobile LEGO* platform. Centrally located with glowing green buttons is the LEGO Mindstorms* EV3 microcontroller, which is connected to the servos that move the base. Also note the support structures and the locking mechanism to mount a laptop.

Rover uses widely accessible and affordable off-the-shelf hardware that many people may already own and can transform into a smart home robot. It consists of a mobile LEGO platform that carries a depth-camera and a laptop for perception, image processing, path-planning, and human-robot interaction. The LEGO Mindstorms* EV3 set is a great tool for rapid prototyping customized robot models. It includes a microcontroller, sensors, and three servos with encoders, which allow for easy calculation of travelled distances.

The Creative Interactive Gesture Camera attached to EV3 contains a QVGA depth sensor and a HD RGB image sensor. The 0.5ft to 3.5ft operating range of the depth sensor allows for 3D perception of objects and obstacles in near range. It is powered solely by the USB port and doesn’t require an additional power supply, which makes it a good fit for mobile use on a robot. Rover’s laptop—an Ultrabook™ with an Intel® Core i7 processor and a touch screen—is mounted on top of the mobile LEGO platform and interfaces the camera and the LEGO microcontroller. The laptop is powerful enough to perform face detection and gesture and speech recognition and to evaluate the depth images in soft real time to steer the robot and avoid obstacles. All depth images and encoder data from the servos are filtered and combined into a map, which serves the robot for indoor localization and collision-free path planning.

Figure 6:Complete Rover assembly with the mobile LEGO* platform base, the Creative* Interactive Gesture Camera in the front and the laptop attached and locked in place.

4. SOFTWARE ARCHITECTURE

Figure 7:Rover's software architecture with most components for perception, a couple of planners, and a few application use cases. All of these building blocks run simultaneously in multiple threads and communicate with each other via messages. The green-tinted components utilize the Intel® Perceptual Computing SDK. All other modules are custom-built.

Rover’s control software is a multi-threaded application integrating a graphical user interface implemented in the cross-platform application framework Qt, a perception layer utilizing the Intel Perceptual Computing SDK, and custom-built planning, sensing, and hardware interface components. CMake*, a popular open-source build system, is used to find all necessary dependencies, configure the project, and create a Visual Studio* solution on Windows* [4, 5]. The application runs on an Ultrabook laptop running the Windows operating system and mounted directly on the mobile LEGO platform.

As shown in Figure 7, the application layer has three different use case components: the visible and audible Human-Robot Interface, an Exploration use case that lets Rover explore a new and unknown environment, and a smartphone remote control of the robot. The planning layer includes a collision-free path planner based on a learned map and a task planner that decides for the robot to move, explore, and interact with the user. A larger number of components form the perception layer, which is common for service robots as they have to sense their often unknown environments and respond safely to unexpected changes. Simultaneous Localization and Mapping (SLAM) and Obstacle Detection are custom-built and based on the depth images from the Perceptual Computing SDK, which also provides the functionality for gesture recognition, face detection, and speech recognition.

The following sections briefly cover the Human-Robot Interface and describe in more detail the implementation of gesture recognition and face detection for the robot.

4.1. User Interface

The human-robot interface of Rover is implemented as a Qt5 application [6]. Qt includes tools for window and widget creation and commonly used features, such as threads and futures for concurrent computations. The main window depicts a stylized face consisting of two buttons: for the robot’s eyes and mouth. Depending on the robot’s mood the mouth forms a smile or a frown. When nobody interacts with the robot, it goes to sleep. When it detects a person in front of it, it wakes up and responds to gestures, which trigger actions. The robot’s main program launches several different threads for the detection of different Intel Perceptual Computing features. It utilizes Qt’s central signal/slot mechanism for communication between objects and threads [7]. Qt’s implementation of future classes is utilized whenever the robot speaks for asynchronous speech output [8].

4.2. Perception

The robot’s perception relies on the camera featuring a color and a depth sensor. The camera is interfaced through the SDK, which enables applications to easily integrate gesture, face, and speech recognition, as well speech synthesis.

4.2.1. Gesture Recognition

Simple, easy-to-learn hand gestures, which are realized utilizing the SDK, trigger most of Rover’s actions. When a person shows a thumbs-up gesture, the robot will look happy, say “Let’s go!” and can start autonomous driving or another configured action. When the robot is shown a thumbs-down gesture, it will put on a sad face, vocalize its unhappiness, and stop mobile activities in its default configuration. When showing the robot a high-five, it will crack a joke. Rover responds to all of the SDK’s default gestures, but here we will just focus on these three: thumbs-up, thumbs-down, and high-five.

Rover’s gesture recognition is implemented in a class GesturePipeline, which runs in a separate thread and is based on the class UtilPipeline out of the convenience library pxcutils in the SDK and QObject from the Qt framework. GesturePipeline implements the two virtual UtilPipeline functions OnGesture() and OnNewFrame() and emits a signal for each recognized gesture. The class also implements the two slots work() and cleanup(), which are required to move the pipeline into its own QThread. Therefore, the declaration of GesturePipeline is very simple and similar to the provided gesture sample [9, 10]:

Besides the empty default constructor and destructor, implementation in GesturePipeline.cpp is limited to the four methods mentioned above. The method work() is executed when the pipeline thread is started as a QThread object. It enables gesture processing from within UtilPipeline and runs its LoopFrames() method to process the camera’s images and recognize gestures in subsequent image frames. The implementation of work() is as follows:

The emitted Qt signals would have little effect if they weren’t connected to appropriate slots of the application’s main control thread MainWindowCtrl. Therefore, it declares slots for each signal and implements the robot’s activities.

The implementation of the actions triggered by the abovementioned gestures is fairly simple. The robot’s state variable is switched to RUNNING or STOPPED, and the robot’s mood is switched between HAPPY and SAD. Voice feedback is assigned accordingly and spoken asynchronously via SpeakAsync, a method utilizing the QFuture class of the Qt framework for asynchronous computation.

The only missing piece between the signals of GesturePipeline and the slots of MainWindowCtrl is the setup procedure implemented in a QApplication object, which creates the GesturePipeline thread and the MainWindowCtrl object and connects the signals to the slots. The following listing shows how to create a QThread object, move the GesturePipeline to that thread, connect the thread’s start/stop signals to the pipeline’s work()/cleanup() methods and the gesture signals to the appropriate slots of the main thread.

When Rover stands still and nobody interacts with it, it closes its eyes and goes to sleep. However, when a person shows up in front of the robot, Rover will wake up and greet them. This functionality is realized using the SDK’s face detector.

Face detection is implemented in a class FacePipeline that is structured very similar to GesturePipeline and is based on the Face Detection sample in the SDK’s documentation [11]. It runs in a separate thread and is derived from the classes UtilPipeline and QObject. FacePipeline implements the virtual UtilPipeline functions OnNewFrame() and emits a signal when at least one face is detected in the frame and a signal if no face is detected in the frame. It also implements the two slots work() and cleanup(), which are required to move the pipeline into its own QThread. Following is the declaration of FacePipeline:

The method OnNewFrame is called by UtilPipeline for every acquired frame. It queries the face analyzer module of the Intel Perceptual Computing SDK, counts the number of detected faces, and emits the appropriate signals.

The implementation of the face detector slots update the robot’s sleep/awake state, its mood, and its program state. When no face is detected, a timer is launched that puts the robot to sleep unless the robot is carrying out a task. This renders the methods easy to implement.

Similar to the gesture recognizer, the main application creates the FacePipeline object, moves it into a Qt thread to run concurrently, and connects the face detector signals to the appropriate slots of the main control thread.

5. RESULTS

Based on our observations at recent exhibitions in the U.S. and Europe, including Mobile World Congress, Maker Faire, CeBIT, California Academy of Science, Robot Block Party, and Game Developer’s Conference, people are ready and excited to try interacting with a robot. Google’s official plunge into the world of artificial intelligence and robotics has inspired the general public to look deeper and pay attention to the future of robotics.

Figure 8: Rover at Mobile World Congress surrounded by a group of people.

Fear and apprehension has been replaced by curiosity and enthusiasm. Controlling a machine has predominantly been done through dedicated hardware, unintuitive control panels, and workstations. That boundary is dissolving now as humans can communicate with machines through natural, instinctual interactions thanks to advancing developments that allow localization and mapping and gesture and facial recognition. Visitors are astounded when they see they can control an autonomous mobile robot through hand gestures and facial expressions utilizing the Ultrabook, Intel Perceptual Computing SDK, and Creative Interactive Gesture Camera. We have encountered these responses across a very wide spectrum of people—young and old, man and woman, domestic and international.

6. OUTLOOK

Unlike many consumer robots on the market today, Rover is capable of mapping out its environment without any external hardware like a remote control. It can independently localize specific rooms in a home like the kitchen, bathroom, and bedroom. If you’re at the office and need to check up on a sick child at home, you can simply command Rover to go to a specific room in your house without manually navigating it. Resembling a human, this robot has short- and long-term memory. Its long-term memory is stored in the form of a map that allows it to move independently. It can recognize and therefore maneuver around furniture, corners, and other architectural boundaries. Its short-term memory is capable of recognizing an object that unpredictably darts in front of the robot, prompting it to stop until the 3D camera no longer detects any obstacles in its path. We are looking forward to sharing further details about robot localization, mapping, and path-planning in future articles.

We see the potential for widespread use and adoption of Perceptual Computing technology is vast. Professions and industries that embody the “human touch” from healthcare to hospitality may reap the most benefits from Perceptual Computing technology. Fundamentally, as human beings we all seek to understand and be understood, and the best technologies are those that make life easier, more efficient, or enhanced in an impactful way. Simultaneous localization and mapping and gesture and facial recognition all working together blur the lines between humanity and machines, bringing us closer to the robots that can inhabit our realities and imaginations.

7. ABOUT THE AUTHORS

Figure 9: Devy and Martin Wojtczyk with Rover.

Devy Tan-Wojtczyk is co-founder of Cubotix. She brings over 10 years of business consulting experience with clients from UCLA, GE, Vodafone, Blue Cross of California, Roche, Cooking.com, and New York City Department for the Aging. She holds a BA in International Development Studies from UCLA and an MSW with a focus on Aging from Columbia University. For fun one weekend she led a newly formed cross-functional team consisting of an idea generator, two developers, and a designer in business and marketing efforts at the 48-hour HP Intel Social Good Hackathon, which resulted in a cash award in recognition of technology, innovation, and social impact. Devy was also competitively selected to attend Y Combinator's first ever Female Founders Conference.

Martin Wojtczyk is an award-winning software engineer and technology enthusiast. With his wife Devy he founded Cubotix http://www.cubotix.com, a DIY community, creating smart and affordable service robots for everybody. He graduated in computer science and earned his PhD (Dr. rer. nat.) in robotics from Technical University of Munich (TUM) in Germany after years of research in the R&D department of Bayer HealthCare in Berkeley. Speaking engagements include Google DevFest West, Mobile World Congress, Maker Faire, and many others in the international software engineering and robotics community. In the past 10 years he developed the full software stack for several industrial autonomous mobile service robots. He won multiple awards in global programming competitions, was recently featured on Makezine.com, and recognized as an Intel Software Innovator.

Intel® Real-Sense™ Technology

First announced at CES 2014, Intel® RealSense™ technology is the new name and brand for what was Intel® Perceptual Computing technology, the intuitive user interface SDK with functions like speech recognition, gesture, hand and finger tracking, and facial recognition that Intel introduced in 2013. Intel RealSense Technology gives developers additional features including scanning, modifying, printing, and sharing in 3D plus major advances in augmented reality interfaces. With these new features, users can naturally manipulate scanned 3D objects using advanced hand- and finger-sensing technology.