VR Interactivity

The VR community, I’ve noticed, uses the word “interactivity” to mean several distinctively different things, each with different user experiences, different technologies, and different associated costs. If you understand these distinctions and you’re involved in VR, you will make better informed decisions. This will be especially true understanding camera-based VR and its differences from model-based VR.

These distinctions are described here. I’m not committed to the actual terms, only to the concepts behind them, so if you have better words, I welcome them. The big goal here, like the last post on VR cinematography studies, is to help the VR community develop a common language.

OK. Here goes . . .

Navigational Interactivity — You are a Ghost

You are a ghost immersed in a VR experience. You can see and hear but cannot affect anything. Of course, this is the case in conventional cinema but it’s much more intense and uncanny in VR. There are two distinct levels of navigation in VR.

1. Rotational Navigation — “Looking Around”

You are a ghost anchored in your seat and you can only turn your head to look around. Such is the essence of “360 VR video” and it requires knowing your head’s orientation (pan, tilt, and roll), like from sensors in the VR headset. But it’s not that simple. First, our eyes — and we have two — are offset from the axes of rotation of our head and neck, so looking around a monoscopic 360 video in VR is an approximation. Monoscopic 360 video camera configurations and stitching are theoretically easier than stereoscopic 360 video camera configurations and stitching, but miss the awesomeness of 3D where it matters most, in the close-up, intimate zone.

I’ve been softening on the need for stereo 360 video the more I experience wonderful mono 360 VR work made cheap by folks in the creative community. I have not been softening on how many poorly-designed mono 360 camera configurations are out there resulting in unnecessarily expensive stitching.

Looking around in stereoscopic 3D requires special, more complicated cameras and stitching, and companies like Jaunt, Google/GoPro, Facebook, and presumably Nokia and Samsung, have cloud-based “assemblers”. It would be a win/win with the creative community if these assemblers are open to DIY hacks, like cleverly shooting multiple passes with fewer, cheaper cameras.

Rotational navigation can be represented by one (if monoscopic) or two (if stereoscopic) standard flat 2D frames. Typically these are stored as 360 degree by 180 degree “equirectangular” images. As such, they can be edited and manipulated using standard photo or video software tools rather than 3D modeling tools.

2. Positional Navigation — “Moving Around”

One of the first battlecries from the gamer community over camera-based VR was that you can’t sway your head back and forth and see around things. High-end VR headsets like the Rift and Vive have sensors for your head’s lateral position (x, y, z) currently via external hardware. From the gamer community perspective, it’s a no-brainer to provide “sway around” imagery given a 3D computer graphic model.

“Position tracking”, a hot-phrase in VR, refers at least to being able to sway around and see properly corresponding points of view. For VR camera configurations, such imagery can be made computationally, but a good rule of thumb is that the amount of sway is limited to the diameter of the camera configuration. The more cameras in the configuration, the better the quality, that is, less artifacts from missing zones, but computing a viewpoint outside of this diameter requires “making things up” since no data exists.

The jury’s out about how much VR will be from a seat (perhaps a swivel chair, perhaps not), or room-scale (as exemplified by the Vive), or even around seemingly-unconfined spaces (large-scale VR like The Void, Zero Latency, or Atari founder’s Nolan Bushnell’s just-announced Modal VR). For anything greater than swaying in a seat, VR from a single stationary camera configuration won’t cut it.

“Volumetric video”, another hot-phrase in VR, is when video from one or more cameras is expanded from 2D pixels to 3D “point clouds” using either computation (like deriving depth by comparing the different points of view) or additional data (like from laser or infrared emitters, as initially popularized by the Microsoft Kinect and more recently, by Google’s Project Tango). One can “sweep” with one camera over time but may need to deal with time artifacts, or can “flash” with multiple cameras from different angles simultaneously, obviously a more expensive proposition. Either way, whatever you don’t flash or sweep (like under the table) results in missing information, or occlusions.

“Light fields” is a specific volumetric video technique which, among other features, allows focus to change after the fact. Light field cameras, whether for panoramic VR or for conventional cinema, require multiple lenses or lens arrays, plus lots and lots of data to be captured and processed. Lytro currently makes the most prominent light field video cameras. They require a refrigerator-size processing computer. The results are impressive.

Unfortunately, “light fields”, as a term, is going the way of “hologram”, in that it’s used for anything “light fields-like” rather than its precise technical definition. I should maybe let this one go: most “filmmakers” today have never touched film.

Volumetric video allows relatively free moving around but is still nothing but a bunch of points in space. There is no “chair” you can move, no “ball” you can throw, and no way to blow things up, because there are no “things”, only points. Turning the points into things is an image recognition problem, a hard and classic AI one, with many banking on big data and deep learning to solve.

Transformational Interactivity — Changing the Outcome

Perhaps the single biggest distinction between “cinema” and “games” is whether or not the user can change or transform the outcome, and it has deep roots in Hollywood versus Silicon Valley cultures. This is hardly news — I wrote about it in 1990 and 2004 — but it’s playing out in the day-by-day and dollar-by-dollar decisions being made in VR today. There are two distinct levels of transforming an outcome.

3) Predetermined Transformation — a New Dawn for Branching Movies?

Don’t laugh, really. Branching movies have mostly been a cheap gimmick, I think largely due to limited database sizes. Like the “interactive divorce” documentary with 15 minutes of “he said” and 15 minutes of “she said” which you can select blow by blow. WTF? You can watch the whole thing in a half hour, so why bother with interactivity? But consider that Buckminster Fuller claimed he could say everything he knew in 42 hours, and in 1975 he did, on video. Or ethnomusicologist Alan Lomax’s vision of a Global Jukebox of song and dance from around the world, thousands of hours worth. Or fundamentally new forms, ranging from massive variations of performance, perhaps via spontaneous method acting, to community-based group story-telling.

VR may provide a clean slate for creative exploration of interactive branching. We already see this on simple menu selection. Can the granularity be finer? Can short clips be tagged with metadata that allow users to interactively move through a narrative? Can a panoramic scene be digitally overpopulated with each person or object being a link to another scene?

Then there’s illusion, and the illusion of control. The story of the “world’s first interactive movie” is noteworthy. Made by Czech filmmaker Radúz Činčera for Expo ’67 Montreal, each seat in the theater had two buttons in front, one red and one green. Audience members could press either at the end of each scene, when a live moderator walked on stage and presented the next two options as storyboard-like still frames. The seats were numbered and the screen was surrounded with the correspondingly numbered red and green lights for everyone to see. No rigging the vote. But the rigging was in the story, which Činčera cleverly wrote such that each scene, no matter which option was chosen, would end with the same next two options. So rather than an exponential 2–4–8–16–32 scenes needed, the “branching” was 2–2–2–2–2, and the projectionist in the booth would simply swap the lens cap and audio feed from two projectors running in sync. “It was a comedy”, Činčera told me many years later.

4) Freeform Transformation — Models Required

The ability to move the virtual chair, throw a virtual ball, or blow things up ultimately requires computer models. It’s important to remember that “model” means just that, a representation which, by definition, requires semantic understanding of it elements. It’s the default state when using 3D modeling packages like Unity or Unreal, where “chair” can be defined as it’s built up from primitive components.

It’s tempting to conclude that 3D computer models are the ultimate end game, where all camera-based material will be stored. For one thing, it may be the most efficient form of compression. The field of “image-based modeling” is lively and promising. It will likely happen. In the meantime, expect to see creative, illusion-rich, hybrid solutions.

Perhaps the biggest surprise in VR since Facebook acquired Oculus in early 2014 was how much cinematic, camera-based VR has stolen the show, at least in the eyes of the press and general public, even with its limited interactivity. Now that we have a vocabulary, we can say that the limited interactivity means rotation navigation but no position navigation, and no transformation of the content via either branching or freeform model-based changes.

There’s something very persistent about “cinema”. In 2014 I gave a VR-related presentation at USC and polled the audience, which included several from the early VR community: who believes that all movies in the future will be stored as 3D models rather than as streams of 2D frames? Only two people raised their hands — and one of them was me.

AR, MR, Social, and Live

VR, as we know it today, is far from a stable medium. As augmented reality, mixed reality, social VR, and live VR enter the mix, expect surprises.

Some time not far in the future, we’ll look back at how things played out and it’ll all be obvious in retrospect. But for now, the best we can do is understand underlying principles as we move forward.