Thursday, December 9, 2010

The natural user interface space has been very busy recently. Last month after Microsoft launched Kinect for Xbox, I organized people interested in using Kinect on PC and founded the OpenKinect community. Since then the community has grown significantly and we’ve seen some amazing tech demos and applications using the libfreenect, the open source Kinect for PC drivers developed by OpenKinect community members.

A few weeks ago, I was contacted by PrimeSense co-founder Tamir Berliner, who wanted to discuss their plans and ways PrimeSense could engage with the community. (Microsoft licensed PrimeSense’s technology and based Kinect on the PrimeSense reference design.) After seeing the dozens of OpenKinect videos and the community enthusiasm, PrimeSense decided to move up their plans and yesterday announced the OpenNI initiative and have open sourced their middleware for natural interaction and drivers for their depth cameras.

Tamir graciously agreed to let me interview him for my blog.

Joshua Blake: Thanks for agreeing to answer my questions. First, could you please give us a brief history of PrimeSense and its role in the natural interaction space?

Tamir Berliner: Thanks for having me :)

PrimeSense started from the founders’ understanding that we are bored with standard gameplay. We felt that computer games didn’t innovate any more but rather mainly used a different story to differentiate. We felt it was time for a change, and understanding where natural interaction is headed to we decided to take the first step towards that vision in the games domain, with a 3D camera that was built for that purpose and at the right price point

Josh: What is your role within PrimeSense?

Tamir: It’s been changing all along in the past 5 years, the main idea is to do whatever it takes to make sure the vision is being fulfilled and that we keep on striving towards Natural Interaction. In the past 5 years I’ve been the head of an R&D team, the first software product manager, I created our partners program and now I’m running this amazing effort of contributing our best and latest to the developers community so you guys can do what we didn’t even start to imagine

Josh: I know that any technology takes a long time to mature. For example, Microsoft Research’s Bill Buxton has doing multi-touch research since before the Apple Macintosh came out in 1984, and there were others before him. Multi-touch has only become mainstream in the last few years. How long has research been going on in the motion tracking and pose estimation space?

Tamir: Over 20 years if you count the first time image processing was done in order to extract a human skeleton from an image, or the usage of two cameras to reconstruct a 3D image.

Josh: How long has PrimeSense been involved in this field? What have you guys been up to that has prepared for today when are seeing mass adoption of motion tracking and natural user interfaces in the home?

Tamir: We started ‘running’ in Sept 2005. It took us 5 years to get the first product that uses our technology to hit the shelves. From day one, our goal was computer games. We actually had first trials of skeleton running in early 2006 which helped us both sell the technology and explain what’s possible, and set up all the different parameters so that they will fit best to the usage in the living room domain and even took some tradeoffs to tailor make it to computer gaming.

Josh: I have heard several people make the comment that what can be done with depth sensors has been done before with single RGB cameras and computer vision techniques. Many of those approaches require controlled environments and are not robust enough to use in any random living room, for example. How do depth sensors and PrimeSense technology change the game as far as the interfaces and interactions that are possible now? How does hardware cost play in to the equation?

Tamir: The main issue that PrimeSense has solved is a robust, real time, low-cost depth-map generating device that can be configured to create a skeleton on a low end PC. Our middleware takes less than 1 ATOM CPU and provides a great skeleton model for two people in real time. It also allows you to control a fully fledged media center from 15ft away with mere hand gestures that we keep on making sure passes user testing so everyone can use those at home

Josh: We touched on pose estimation before. When thinking about human-computer interaction, what can we do with a skeleton model of humans in a real space that we cannot do with just raw depth data or point clouds?

Tamir: The raw data is great for testing and rapid prototyping of basic interaction. The main issue is that it usually requires the user to be too aware of the algorithms – take for example the virtual touch screens that we’ve seen - you have to know exactly where to stand and how far your hand needs to reach. These cannot work in the living room or in standard home environments as you will just walk through the virtual screen, interacting with the TV without wanting to. Also imagine that you need to get up every time you want to flip channels or change the volume?. Another example are all the different ‘trainers’. Whether it’s titles like Dance Central or Your Shape, those cannot be made with just raw data, you have to know where the user is and what exactly he/she is doing

Josh: You have described PrimeSense technology as enabling Natural Interaction. What does the phrase mean and where is it heading?

Tamir: The goal of Natural Interaction from my point of view would be to be able to take any person (even those who have never used a technological system) and have them be able to use it without any training. I think that Kinect and iPhone are both a great step towards that direction, but we’ve got a lot to do both on the hardware side and on the software side. Merging the interfaces between all the different devices that you have and making them “know you” is a key step and require tons of work. I believe this revolution will take at least 10 to 15 years till we can look back and say ‘what were we thinking with those mice and keyboards?!’

Josh: Is the future of natural interaction and 3D cameras limited to games and entertainment or will it see adoption in other fields as well?

Tamir: Natural Interaction is already coming from all different directions at us, you can see it with Wii, iPhone, and even in public restrooms where the water tap understands you want to wash your hands and starts the water flow automatically. The more we see Natural Interaction being introduced into the market, the more you will see everyone using technology. The keyboard and mouse seem ‘normal’ today, but if you come to think of it, they are as natural to a human as leash is to a dog.

3D cameras are of course one of the key enablers of Natural Interaction, just as speech and touch are, and as such I foresee this technology being incorporated in many more technological devices in different forms and as different input paradigms. We can see great examples from the community with regards to what’s possible and I believe this is merely the beginning. Another very important area where 3D cameras are coming into play is providing technology with “eyes”. Just like providing phones with accelerometers and GPS was a big step, so will this be and some of the main targets will be technologies that need to move and navigate in the real world

Josh: We’re already seeing a lot of early videos of these types of applications come out of the OpenKinect community. What was the reaction inside PrimeSense like to the really creative and diverse ways that people are using the Kinect depth sensor? I heard it caught you by surprise.

Tamir: The reaction was ‘Wow’ after 5 years of meeting with people and trying to explain what is possible, you finally see people who not only do things you’ve thought of, but do much more and create amazing applications that we didn’t even consider! It’s mind blowing to see that creativity is far greater when it’s in the hands of a community, and it’s one of the main reasons for why PrimeSense has decided to provide the community with the work that it has invested a lot into. I don’t know if the people in the community will like what we’ve got to offer, but what led us to make the decision was that we’ve seen people starting to tackle issues that we’ve been working on for years now so it would almost be a ‘sin’ not to contribute from our experience.

Josh: Clearly there is a quickly growing community that is extremely enthusiastic about the possibilities unlocked by this type of technology. (The OpenKinect mailing list has grown to over 1250 members join in just four weeks.) Even so, Natural Interaction is still new and we are still at the beginning of the Natural User Interface revolution. What are the biggest challenges that we will see to adoption of Natural Interaction across the industry?

Tamir: I believe that the value of Natural Interaction is great, but only when it’s implemented properly. Only a small set of companies have been able to really create devices that both are worth the money and have provided a real Natural Interaction leap. As I mentioned I see companies like Microsoft with the Kinect, Apple with the iPhone and Nintendo with the Wii as being the first and I can only hope (as a consumer) to see everyone in the world understanding the this is merely the beginning. The number one issue for all of the devices is to simulate the interfaces that we already know from the ‘real world’. Wii did it with motion, iPhone did it with touch, and Kinect has taken it to a whole new level of full body integration into games and media control with the great adder of speech detection.

Josh: This leads us to the OpenNI initiative. Tell us about Open Natural Interaction, who is involved, and the role you see it playing.

Tamir: I believe OpenNI to be a major key player in Natural Interaction. As I’ve mentioned, Natural Interaction consists of hardware devices that we’ll see coming from a variety of CE [Consumer Electronics] vendors, and software modules that will be developed by many different software providers. Having all of those comply with one standard ensure that any application that is developed can deploy on a large number of systems with an install base that is not fragmented. As the first step we’ve worked with some of our partners to create OpenNI and at this point we’ve decided to launch it a bit earlier than expected due to the large amount of developers already working. Our initial goal was to launch the OpenNI standard a few months from now. Many industry leaders are interested and when they are ready they will join us in the management of the OpenNI consortium.

Josh: Along with OpenNI, PrimeSense released a whole suite of software, including some open source. What can the public do with this software, and how would you like to see the community engage with OpenNI? Are you open to people contributing to the software projects or extending it with new capabilities?

Tamir: The software is divided into two main parts, the first being the OpenNI standard which is mainly the APIs to devices and image processing algorithms. Extending OpenNI and submitting changes is welcomed. In the next versions, the OpenNI consortium will publish what will be included prior to closing up the version so everyone can suggest introducing their changes to the official release.

The second part is NITE which is PrimeSense’s middleware for its SoC (System on Chip) and includes UI framework for controlling the TV from 15ft away without the need to start searching for the remote. This framework takes a bit time to learn to use but provides great power to the developers, as it provides them with exactly what they need – events that are generated according to the user’s motions but that are still tailor made for their applications.

The other part NITE includes is the skeleton image processing module that enables the application developer to receive the joints of the user that initiated interaction (the way to start interaction is by standing with your hand to the sides in a ‘freeze’ position). After that you have a skeleton that sticks to you. In order to contribute even further we’ve also provided the “scene analyzer” which segments the users from the background so you actually can enumerate the depth map and get the number of users. You can see both the Skeleton and Scene Analyzer in these videos on Youtube

Josh: Those videos are awesome. I’m just imagining what I can do with those capabilities. Is there anything else you’d like to share or tell us about?

Tamir: Sure – I believe that till today the devices we’ve been using, made us learn greatly lot about them before we could use them and gain their value. I’m pretty sure everyone who is reading this has got at least 3 remotes sitting on his living room table, and at least once a week needs to help someone use their computer/media center/phone/etc. It’s time for that to change and it’s up to us, the technologists to make this revolution happen, it’s time for the devices to take the step of understanding what we want and making sure we get that, even without asking if it’s a trivial task as opening a door when we approach, closing the lights when we leave the room, even making sure we have hot water to shower with when we return from work or wake up in the morning, depends on what we normally do.

Having said that, we need to remember that now localization is not only changing the written language, it’s becoming the gestures that people are used to, the accent in speech detection. Even the difference between when a user wants to control and when they don’t is not so trivial anymore as it’s not just “press the button”… it’s up to us to make sure that the users enjoy a new set of very compelling experiences that don’t break. It’s a new huge growing market, and I’d suggest anyone with technological skills to jump on and enjoy the creation of this new era.

Josh: I’m very excited about all of this. Thanks again for answering these questions and I’m looking forward to working with you to make some amazing natural user interfaces and natural interactions.