KAGURA: Blending Gesture and Music into a Bold New UI

Thanks to a host of mainstream innovations, the world is increasingly interested in perceptual computing. Intel is throwing open the gates and directing traffic into this new ballpark through the Intel® Perceptual Computing Challenge. This year’s Phase 2 grand prize winner, KAGURA, is a music app that delivers a fascinating spin on music creation, blending sampling with a drag-and-drop interface and an inspired use of camera depth to differentiate user functions. KAGURA lets anyone make music with an intuitive, game-like ease (Figure 1) that is wholly new.

Figure 1: Hit it! The KAGURA interface lets users activate instrument sounds with their bodies or even by “striking” them with objects.

Shunsuke Nakamura and his team at Shikumi Design, a forward-thinking outfit founded in 2005, wanted to design a system where PC sensors and musical enjoyment might overlap. Nakamura explained, “We wanted to target people who cannot play musical instruments but are interested in performing music. By having these people move their bodies, it is possible to go beyond practicing music. We wanted to let them produce and arrange.” The end result, KAGURA, is a drag-and-drop marvel of motion-based music creation. The application comes with a host of ready-made instrument sounds that users can place on the interface as icons that overlay the user’s live image as seen by the host system’s camera. When the user passes a hand over the icon, or perhaps whacks it with a mallet, that sound gets played.

KAGURA Form and Function

There’s more going on in KAGURA’s UI, and more subtle refinement of that interface, than may meet the eye. The most innovative aspect of KAGURA, though, may be its use of distance as measured by the 3D-based Creative Interactive Gesture Camera. Volume is controlled through the user’s distance from the camera, and when the user’s hand is less than 50 cm (about 20 inches) from the lens, the program enters a special mode (Figure 2) that makes it appear as if the hand is reaching below a water surface—a novel and visually striking element.

Figure 2: Part of the genius in KAGURA’s interface is its separation of several controls into a “near-mode” that is reachable only by putting a hand close to the camera and through a virtual sheet of water.

Diving into Water

Shikumi wanted a way to differentiate between “near” and “far” controls. This approach would deliver a simpler, more intuitive interface and meet Nakamura’s objective to eliminate the need for a manual or lengthy feature descriptions. The team decided that 50 cm from the camera was a good distance at which to place a virtual barrier to distinguish between the two distances. Nakamura originally wanted a “film” effect, but his team had previously developed the physics computations for rendering water, so they took the seemingly easier route and recycled their prior work. However, it didn’t turn out to be that easy.

“Our programmer had actually given up, saying, ‘This is impossible,’” noted Nakamura. “But we decided to push forward a bit more with a final effort, and then we finally achieved success. With the water effects, the result felt very good. Our process steps were to perform image processing, divide the depth data value into two at 50 cm from the camera, create wave-by-wave equations, distort the image based on wave height, and add color to the part of wave closest to the player.”

The near- and far-mode paradigms were one way to work around the accuracy challenges of KAGURA. Using distance, Shikumi segregated playing from object manipulation to give the user full functionality from a single UI screen and greatly reduce unintentional user errors. However, making the water effect (Figure 3) look convincing was no simple task.

Figure 3: A considerable amount of graphical abstraction and computation went into creating KAGURA’s water effect.

Nakamura presented the following explanation of how the designers compute waves:

A wave equation with damping force and restoring force is expressed as follows:

where u is the displacement (Figure 4), D is the damping coefficient, K is the stiffness coefficient, and c is the wave velocity. By means of a Taylor expansion,

the following approximate expression can be obtained:

Similarly for x and y,

If we assume , we obtain the following expression:

Equivalent code may be as follows:

Figure 4: Each displacement u is calculated by its past values and neighbor values.

Refraction

We assume the distortion is approximately in proportion (Figure 5) to the gradient of the wave, .

Figure 5: If the camera image is set at the back of the wave, as shown here, the camera image will look distorted.

Implementation

The steps to the image processing are as follows:

1. Get the depth image.

The depth image can be obtained through the Intel® Perceptual Computing SDK. Each pixel of depth image indicates the distance (mm) from the camera. Depth image can be mapped to color coordinates using a UV map.

2. Binarize the depth image.

For each pixel src(i), if src(i) < threshold, we assume its position is in near-mode region and assign dst(i) = 255. If not, we assume its position is in far-mode region and assign dst(i) = 0.

3. Dilate the binary image.

Apply dilation to remove noise and expand the near-mode region so that the region looks clear.

4. Apply the wave equation.

The wave image is a floating-point image, and its pixel value is normalized in the range [-1,1]. If the position is in the near-mode region, the output value is forcibly assigned an upper limit of 1. If the position is in the far-mode region, the output value is calculated by the wave equation discussed above.

5. Apply the refraction.

From the wave and camera images, we can obtain a refracted image, as shown in Figure 5.

6. Color the region.

Color the far-mode region as water (Figure 6) so that the near-mode region looks clear.

Figure 6: The process steps between camera input and final effect rendering.

Challenges Addressed During Development

Nakamura notes that his team “did not struggle much” when creating KAGURA; however, they did struggle with interface accuracy. Specifically, capturing an on-screen icon originally proved difficult because achieving fingertip precision through a camera into a virtual space located several feet away was imprecise at best. Swinging to the other extreme by requiring much less accuracy would result in users unintentionally grabbing icons.

Naturally, the team sought to find a good compromise, but ultimately they had to decide on which side to err: precision or occasional unintended grabs? After much user testing, Nakamura finally opted for the latter. “We thought people would feel less stress with unintentional actions than intending to perform an action and failing at it.”

This decision yielded an interesting revelation: The object was not to eliminate errors for the user. Rather, the user needed to have an understanding of the intended outcome for an enjoyable experience. So long as the UI conveyed that an icon drag-and-drop should be possible, the user would be content with more than one attempt, provided that success was soon achieved.

Lessons Learned and Looking Ahead

Not surprisingly, the top lesson Nakamura passes on to aspiring gesture developers is to strive for more of a “general sense” than true gesture precision. “There will be constraints involving processing if precise actions are taken,” he said. “Rather than being accurate, it is important to be able to communicate what gesture is desired. Also, developers need to understand that this communication is part of the entertainment.”

Nakamura maintains that the touch screen model is no longer sufficient for some applications. With the advent of affordable 3D cameras and perceptual computing, designers need to start building depth into their models. Depth presents an additional stream of information that should be utilized to enable new functionality and improve experiences whenever possible. Thus, the image quality of the camera matters. As camera resolution and sensor quality improve, Nakamura expects developers to have more control and flexibility in their designs. Until then, the burden of finding workarounds remains on developers’ shoulders.

As for KAGURA, Nakamura would like to see the program grow richer, both in what it offers and the ways in which users can customize it. Currently, KAGURA offers only four musical background styles; this could be easily expanded. Similarly, users might be able to import their own plug-in instruments. Even the interface might change to allow users to bring their own background artwork.

Shikumi will continue to explore and grow its business with perceptual computing. With a grand-prize-winning application on their hands, Nakamura and his company are off to an inspiring start that he hopes will beckon many others to follow.

Resources and Tools

Shikumi made extensive use of the Intel® C++ Compiler (part of Intel® C++ Studio XE) and the Intel® Perceptual Computing SDK, which the developers found to be fast and optimal for the job. The base system of the program, such as device I/O, image processing, and sound processing, was written in C++. More internal elements, such as graphics, sounds, motion, effects, and user interactions, were written in Lua. The OpenCV library supplied basic image processing while OpenGL* served for graphics drawing.

Nakamura added, “When we use image processing outside of OpenCV, we use C++ for the basics because processing speed is required. When adjustment is necessary, we export the C++-based module to Lua and describe in Lua.”

About Shikumi Design and Shunsuke Nakamura

Shunsuke Nakamura is the founder and director of Shikumi Design, Inc., a software developer focused on design and interactive technology. Nakamura received his PhD in Art and Technology (Applied Art Engineering) from the Kyushu Institute of Design in 2004. That same year, he became an associate professor at the Kyushu Institute of Technology, where he continues to teach today. In 2005, Nakamura started Shikumi Design, bringing several of his junior researchers from the school. The group began winning industry awards in 2009, but taking the grand prize in Intel’s Perceptual Computing Challenge against 2,800 competitors across 16 countries marks their greatest competitive achievement so far.

Intel® Real-Sense™ Technology

Developers around the world are learning more about Intel® RealSense™ technology. Announced at CES 2014, Intel RealSense technology is the new name and brand for what was Intel® Perceptual Computing technology, the intuitive, new user interface SDK with functions such as gesture and voice that Intel brought to the market in 2013. With Intel RealSense technology users will have new additional features, including the ability to scan, modify, print, and share in 3D, plus the technology offers major advances in augmented reality interfaces. These new features will yield games and applications where users can naturally manipulate and play with scanned 3D objects using advanced hand- and finger-sensing technology.