Wagner James Au reports on virtual worlds, VR & Internet culture

Tuesday, August 05, 2014

Watch This High Fidelity Avatar Animated by a Pixar Vet Imitate Real World Facial Movements in Near Real-Time

Philip Rosedale just sent me this new video demo of an avatar singing Christina Aguilera's "Beautiful" in High Fidelity, his Oculus Rift-compatible virtual world, and if you know all the art and technology behind it, you'll think it's pretty cool:

The singer is actually High Fidelity's Emily Donald (who has a lovely voice), and the avatar is imitating her actual face and lip movements as tracked via a PC camera pointed at her, and in near real time. The avatar herself sort of looks like a character in a Pixar movie, and that's no surprise: The facial animations were created by High Fidelity's Ozan Serim, who was a longtime manager at Pixar, before joining Philip's company. (Serim worked on Monsters University, Cars 2, Brave, and Toy Story 3 there.) The facial animations are more than enough to convey emotion, and the lip sync is just about perfect. (Bad lip sync remains a horrible problem in Second Life, not to mention other MMOs/machinima platforms.) As it happens, Philip and I were just e-mailing about how live music performance can be a compelling thing in virtual reality, so this video is a case study of that.

How was this shot, and what's the latency between her face movements and the avatar animations. Philip explains:

"This was done by Emily and Ozan (who is playing guitar) in our little back office here. Emily is in front of a Primesense Carmine camera, using Faceshift to detect animation." Philip adds: "Ozan has been doing amazing work designing avatar facial movements that map well to that sort of live camera, and this is the result."

As for this performance tracking in near real time: "The latency of both audio and avatar motion ends up being about the same at roughly 100 milliseconds, which is why you can't see any difference between the two. Technically speaking it comes from different reasons (the Primesense camera imposes about 85 milliseconds of delay between photons hitting the screen and us getting the tracking data but sends/receives at 60 framers per secon, where the audio has about 40 milliseconds of delay on each of microphone and playback), but they end up getting to you at about the same time." Philip has argued before than 100 milliseconds is the target for magic to happen in VR, and that magic seems to be happening here.

"Next," Rosedale adds, "we are working on getting her whole body moving!"

Comments

It looks pretty rough right now. Her head is tilted back a bit too far so we don't see her face full on. The eye movement was disconcerting too. But, overall it was impressive compared to SL. Given time this will look awesome.

Was the avatar singing Bohemian Rhapsody?
The avatar is in the rough stages (although Stroker seems quite attracted to it *cough*) but it's pretty impressive nonetheless.
I think Linden Labs has their work cut out for them.

As a vocalist I think this technology would be amazing if the real time would be true-to-real-time. Even the slightest slightest animation delay makes it feel more like watching an avatar sing karaoke than it is an avatar singing in sync to the music. If you want to be immersive you are going to have to eliminate that delay all together for a real worthwhile believable experience but we are a long way from that at the moment, but give it two years. Yay oculus teams!

From crank the car, remove the crank, get in, and hope for the best to press a button on your key, the car starts, get in and drive away.

Technology is rough; then smooth. The folks at HiFi are visionaries with an attainable goal: Create an immersive and low latency world that raises the bar from what's available now. Better yet, from my standpoint, it will still allow users to create.

I really don't mind the cartoony, stylized approach... as proof of concept. Get that same latency with a realistic, detailed contemporary avatar, and my interest will be more than academic.

On a side note, EverQuest II did facial capture a few years back with the SOEmote feature. I never played with it much, so I don't know how well they did with the lip-sync. Put a voice morpher in the loop, and I would imagine that the timing gets a little dicey. I would trade a little more latency for a rock-solid sync.

I hear all the naysaying here. But what you're realy looking at is advanced tech. The only reason the avatar looks this way is because this is what the creator chose for it. Think liquid mesh, full materials, great lighting, true gravity. For a platforom still in 'Alpha platform I've seen far worse.
It's not what you see - because that's restricted to the people involved. It's what's possible (for both good and ill unfortunately) - like Second Life, where the good go to create & build, the bad follow to steal & destroy.

Metacam, I can't say for certain, but I think that'd only work if you could precisely measure the gap between audio and video processing... which is probably not consistent from second to second (hence Philip throwing a lot of qualifiers like "roughly" and "about").

I can think of approaches to analyze and match the mouth shape with the phonemes from the audio stream, but I don't know how much processing overhead that would add when you're measuring in milliseconds. "Good enough" might be as good as it gets.

Clearly you can get some good expression translation with good lighting conditions and some input smoothing from the raw camera data. Previous live demos suffered from lighting issues and unbuffered output making everyone want to barf into the uncanny valley.

As far as the comparison with SL lip sync, there isn't any. The Vivox software SL uses for voice only hands off a very basic "energy" variable via the ParticipantPropertiesEvent. The best you are ever going to get with that is a basic indication of how much you should morf the mouth open.

I know people go on about phoneme detection and puppeting, and when you can't see the lips, it's pretty good. But for a true performance you want as accurate a translation of the face as possible. For Alpha this is looking really good.

Something vaguely creepy. I think it does skirt on the edges of the uncanny valley a bit. I'd like to see a side-by-side comparison of her actual face, to see how well it's being tracked. We can see a bit of this in Philip Rosedale's demonstration at SVVR9: http://youtu.be/gaWacrQuEcI?t=42m40s It seems like the avatar has a tendency to smile too much, a lot more than the source, which I think might also going on here.

Also, I don't really understand how this is supposed to work with the Rift? Will it primarily just track the mouth?

What is not there is the streaming part, the streaming part could be added by two or three student coders from a tech university in about a month or two. Stream voice or stream the caputured animation data is really not that hard. You just send the packets with the data from the server to the client. It is worth to watch the unity presentation as the tech presented there provides you with a more deep insight.

Looking a bit deeper this did show up: https://www.youtube.com/watch?v=NFBv_ypyhiA
live motion capture data streaming in Second Life in the year 2010 and nobody blinked when it was possible, nobody used it or saw potential in it.

There is really a lot of impressive gadgets and tech around these days, in particular in Unity because Unity is free so all students experiment with it.