Neural Networks, AI and Machine Learning Question

This is a pretty high level question and I've posted it here to initiate a discussion of ideas. I've recently came upon this BabyX demo of a digital human that is capable of learning, emotionally responding, and interacting with users communicating with her. She has vision and is capable of hearing. So my question is: Where would we start to implement something like this in Unity? I'm thinking machine learning should be useful for this and I'm thinking it would require a system to feed in object tracking via a camera into a Unity scene. I'd like to create something extremely basic to start, a "brain" that controls "muscles". Any thoughts or discussion on approaches to this would be very welcome.

My goal with this would be to eventually create a digital companion. This would be a side project and not something I'd have commercial pursuit for initially. Maybe down the road it would have some viability or commercial application, but today I'm more interested in learning more about AI and machine learning and how to use that to control a digital avatar.

It's super hard to tell. That video is a demonstration that strikes me as being "too good to be true" - like to what level are they really modelling all of that? I will say, that if it's not hype and that's a functioning brain model that drives emergent behavior - then that's one of the most incredible things I've ever seen. If it's just a bunch of rigged behaviors... frankly it's still pretty impressive, although for different reasons.

Click to expand...

This crosses my mind as well, but as I lay out the core of what I think he's doing it doesn't seem too complex at a high level. Now it does dive down into aspects that become extremely technical, so I'm not sure what could be "real" or what would be an "illusion". Easy enough to see he is driving musculature through input coming from this "brain" and it does appear to be procedural vs. pre-canned animation from bones. The real question arises concerning what's really going on within the brain itself. He claims to be simulating hormones/chemicals within certain regions of the brain and receptors receiving that data and then transmitting it out. It's that traversal process that has me in wonder. It almost seems to be a black box to me so it's hard to quantify if that aspect is formulated trickery or if there are indeed complex processes going on. I'd like to see if I could emulate a much simpler version of this traversal of data from a "brain" to muscles driving a bound skin mesh. I feel I could worry about feeding the brain data from video, images, etc at a later point. Initially, I would manually give the brain it's data.

But then we're talking about "memory" and storing all of that data that would be fed into the brain. Allowing the brain to reference this memory and then alter future actions based on that.

The real complexity is the machine learning, which wouldn't necessarily have a lot to do with Unity. It seems like a camera and a microphone connected to a very sophisticated machine learning model of a brain. I don't really see where Unity comes in, but maybe you have some ideas that it would be useful for?

The real complexity is the machine learning, which wouldn't necessarily have a lot to do with Unity. It seems like a camera and a microphone connected to a very sophisticated machine learning model of a brain. I don't really see where Unity comes in, but maybe you have some ideas that it would be useful for?

Click to expand...

Unity becomes relevant in regard to the machine learning aspect. Unity has recently, within the last year or more, released more functionality to reproduce machine learning. I know little about it, but I've seen Unity demoing agents that are capable of learning. Unity also would play a big role in the visual rendering of interfacing with these types of intelligence.

I'm interested in the application of something like that for a digital avatar.

This crosses my mind as well, but as I lay out the core of what I think he's doing it doesn't seem too complex at a high level. Now it does dive down into aspects that become extremely technical, so I'm not sure what could be "real" or what would be an "illusion". Easy enough to see he is driving musculature through input coming from this "brain" and it does appear to be procedural vs. pre-canned animation from bones. The real question arises concerning what's really going on within the brain itself. He claims to be simulating hormones/chemicals within certain regions of the brain and receptors receiving that data and then transmitting it out. It's that traversal process that has me in wonder. It almost seems to be a black box to me so it's hard to quantify if that aspect is formulated trickery or if there are indeed complex processes going on. I'd like to see if I could emulate a much simpler version of this traversal of data from a "brain" to muscles driving a bound skin mesh. I feel I could worry about feeding the brain data from video, images, etc at a later point. Initially, I would manually give the brain it's data.

Click to expand...

The thing is, the responses to his actions are obviously hyper sensitive. So they definitely cranked up the reactions and focused on very specific actions.

So the question to me is - did they basically train the machine to respond in a pre planned fashion or did they attempt to simulate the brain and achieve emergent behaviour.

IE: did they train it to become agitated if it sees a face and then the face goes away, or did that behavior emerge?

It pretty much has to be that they trained 'become agitated' because actually emergent agitation is probably way beyond our current capabilities.

Unity becomes relevant in regard to the machine learning aspect. Unity has recently, within the last year or more, released more functionality to reproduce machine learning. I know little about it, but I've seen Unity demoing agents that are capable of learning. I'm interested in the application of something like that for a digital avatar.

Click to expand...

Fair enough, but I would say the best thing is to understand your tools first. What machine learning algorithms are relevant to models of human emotion? I would write some very simple machine learning algorithms first, because one way or the other you'll have to be able to know fundamentally whats going on.

Thank you! This is fascinating...very similar to the depth map produced by self driving cars to allow them to navigate the real-world. You have me thinking of this now in an interesting way. Instead of a depth map being fed to our AI intelligence, what about using some kind of AR face/body tracking input to feed in a view of a living human to the digital avatar? I'm thinking of this in terms of my own goal, which is to allow a living human to realistically interact with a digital avatar on a screen.

The thing is, the responses to his actions are obviously hyper sensitive. So they definitely cranked up the reactions and focused on very specific actions.

So the question to me is - did they basically train the machine to respond in a pre planned fashion or did they attempt to simulate the brain and achieve emergent behaviour.

IE: did they train it to become agitated if it sees a face and then the face goes away, or did that behavior emerge?

It pretty much has to be that they trained 'become agitated' because actually emergent agitation is probably way beyond our current capabilities.

Click to expand...

How do you separate emergent behavior from planned behaviour? I think it's clear that one way or the other, the model was not 'born' the minute the demo started. In any case, unless one wants to simulate the entire evolutionary journey of life itself, quite a lot of pre-programmed instincts would be required.

I would imagine the best way to start off is to create a set of emotions with blend shapes that are driven by whatever your model comes up with, and go from there.
Fair enough, but I would say the best thing is to understand your tools first. What machine learning algorithms are relevant to models of human emotion? I would write some very simple machine learning algorithms first, because one way or the other you'll have to be able to know fundamentally whats going on.

Click to expand...

Completely agree with you. My weak point is the machine learning aspect and understanding the fundamentals one would need to simulate from a brain. Of course I know I'll have to study the machine learning and integration aspects heavily if this is truly an endeavor I'd undertake. How to connect the two in Unity. I'm not lacking for knowledge of any of the other tools at my disposal.

How do you separate emergent behavior from planned behaviour? I think it's clear that one way or the other, the model was not 'born' the minute the demo started. In any case, unless one wants to simulate the entire evolutionary journey of life itself, quite a lot of pre-programmed instincts would be required.

Click to expand...

I think its important to note he does say the AI has a memory, it does learn and it stores all of this. In this way the AI "grows up" to a degree and becomes smarter. It already has a good foundation when he boots the simulation up. I do believe this is the 3rd generation of the AI baby he's worked on, so it's been learning for a decent time now.

Yeah that sounded odd to me. He's not doing any kind of visual compositing. He's actually borrowing a common film CG technique in which the characters face's are driven by simulated muscles, not so much traditional bone structures. However traditionally you can layer over blendshapes, bones and deformers to assist in the skin deformations in problem areas. Most major 3D packages today have a built-in muscle system, the one in Maya is pretty awesome. However Unity doesn't have a muscle system that I'm aware of, so that would have to be developed for the editor.

How do you separate emergent behavior from planned behaviour? I think it's clear that one way or the other, the model was not 'born' the minute the demo started. In any case, unless one wants to simulate the entire evolutionary journey of life itself, quite a lot of pre-programmed instincts would be required.

Click to expand...

The separation between the two is the difference between a machine that is actually learning and responding and really fancy interactive animatronics.

Don't let the 'brain' metaphor or the visuals trick you. Can you replace the 'brain' with a dozen nodes and a set of sliders?

If its fancy interactive animatronics, its still very good work. But it's something very different than what's being sold by the speaker.

The separation between the two is the difference between a machine that is actually learning and responding and really fancy interactive animatronics.
Don't let the 'brain' metaphor or the visuals trick you. Can you replace the 'brain' with a dozen nodes and a set of sliders?
If its fancy interactive animatronics, its still very good work. But it's something very different than what's being sold by the speaker.

Click to expand...

I don't think he's pre-canned the actual movement or feeding it default values. It appears jerky and uncontrolled at points, which leads me to believe the virtual muscles are being fed some kind of "raw" data. I put raw in quotes because as he iterated on the digital baby, it was clear he found ways to smooth the movement to a degree. The real question comes from, how is this data being fed to the muscles being produced, how is it being told which muscles to contract or expand. That could be the aspect that has some trickery involved.

I don't think he's pre-canned the actual movement or feeding it default values. It appears jerky and uncontrolled at points, which leads me to believe the virtual muscles are being fed some kind of "raw" data. I put raw in quotes because as he iterated on the digital baby, it was clear he found ways to smooth the movement to a degree. The real question comes from, how is this data being fed to the muscles being produced, how is it being told which muscles to contract or expand. That could be the aspect that has some trickery involved.

Click to expand...

Yeah, there's a bunch of different bits.

1: how is the animation itself being driven.
2: how is the behavior being driven.
3: what is the actual behavior we're watching.

I'm mostly focused on the 3rd point, which is probably the most exciting. #1 is where regardless of the other points, they did pretty extraordinary work. I think the answer to #2 is probably the most boring.

The separation between the two is the difference between a machine that is actually learning and responding and really fancy interactive animatronics.

Don't let the 'brain' metaphor or the visuals trick you. Can you replace the 'brain' with a dozen nodes and a set of sliders?

If its fancy interactive animatronics, its still very good work. But it's something very different than what's being sold by the speaker.

Click to expand...

But you still haven't defined the difference between real and fake. That's what I'm trying to get at. You cannot call something fake when you haven't defined what 'fake' is. What specific claim seems false?

Do you think someone is at a keyboard controlling the 'AI'? I'm pretty sure that we've gotten to the point where you can make a loud noise and make an AI have some kind of reaction, without using a keyboard.

Anyway my point is that:

- It's highly unlikely the model 'learned' from absolutely nothing how to use its facial muscles
- It's highly unlikely that the model does not have pre-programmed emotions and expressions.
- It's highly unlikely the connection between stimulus and reaction is not preprogrammed to some extent.

But you still haven't defined the difference between real and fake. That's what I'm trying to get at. You cannot call something fake when you haven't defined what 'fake' is. What specific claim seems false?

Do you think someone is at a keyboard controlling the 'AI'? I'm pretty sure that we've gotten to the point where you can make a loud noise and make an AI have some kind of reaction, without using a keyboard.

Anyway my point is that:

- It's highly unlikely the model 'learned' from absolutely nothing how to use its facial muscles
- It's highly unlikely that the model does not have pre-programmed emotions and expressions.
- It's highly unlikely the connection between stimulus and reaction is not preprogrammed to some extent.

... and, well, same for humans.

Click to expand...

Ah yes thank you for clarifying. You are spot on and the simulation had to start somewhere, from some default set of "knowledge" (i.e. data). So now this has me wondering how does the Unity Machine Learning stuff start out? It's likely very wonky at first and has to be run through hundreds upon hundreds of iterations until it's competent at a task.

1: how is the animation itself being driven.
2: how is the behavior being driven.
3: what is the actual behavior we're watching.

I'm mostly focused on the 3rd point, which is probably the most exciting. #1 is where regardless of the other points, they did pretty extraordinary work. I think the answer to #2 is probably the most boring.

Click to expand...

Agreed. I can clear up your 1.) point. It's clear if you've ever done film VFX work, which I have. The animation is being fully driven by a muscle system, similar to the ones found in Maya or Houdini. These muscle systems have been around a long time and you can script all kinds of interaction with them, to contract them, expand them, bulge them, etc. He's driving the animation of the face by feeding these muscles values for their movements. These muscles are bound to regions of the face and deform the skinned mesh, so it's relatively straightforward. Turning the head just means contracting one muscle on the side of the neck...or potentially other muscles in tandem. Now where it gets crazy complex is all of these different facial muscles working in synchronization, all moving just the right amounts to raise the brows, widen the mouth, crinkle the nose, etc.

I agree with point 2.) being kind of boring to think about. Is the behavior being driven by real time data being fed from a series of very complex interactions of values...or is just being fed a few simple/filtered inputs from a face tracking camera? Who knows.

Your point 3.) is the very interesting aspect. What is the exact chain of events leading to the overall behavior we're seeing here? How are they all interrelated and interacting with one another? How is the data for this behavior being segmented and partitioned off into the zones that require it? Like you, I have a ton of questions for this part of it. It's a black box to me, in which I don't know whats happening in that nebulous space.

But you still haven't defined the difference between real and fake. That's what I'm trying to get at. You cannot call something fake when you haven't defined what 'fake' is. What specific claim seems false?

Click to expand...

I did. Here's a paste of my earlier post.

So the question to me is - did they basically train the machine to respond in a pre planned fashion or did they attempt to simulate the brain and achieve emergent behaviour.

IE: did they train it to become agitated if it sees a face and then the face goes away, or did that behavior emerge?

The question comes down to how exactly did they train the machine. And the difference is in the scope of the behavior the machine can produce.

Is there essentially logic that says:

If Face & Happy then + HappyPoints

If HappyPointsNow < HappyPointsPrevious then Cry

Then you plug in google face recognition to determine when you see a face or not.

That's not particularly impressive. You need to very specifically define "Face & Happy" then map that to increased HappyPoints (or dopamine).

Ah yes thank you for clarifying. You are spot on and the simulation had to start somewhere, from some default set of "knowledge" (i.e. data). So now this has me wondering how does the Unity Machine Learning stuff start out? It's likely very wonky at first and has to be run through hundreds upon hundreds of iterations until it's competent at a task.

Click to expand...

Thats a good question. Just to be clear I have no knowledge of Unitys machine learning, though I did write a simple neural network once.

But anyway, how would the separation and connection between learning and instinct be managed? How do you separate learning data from 'known' data? That's already a difficult question that can probably only be answered by starting off with a small and very simple prototype and learning by trial and error.

Then you grab all inputs.
ML is not tricky part. The part is, associate all emotions with mimics.
Happiness can be represented by different states itself. Then you can be happy but surprised etc.
Neural Network can actually deal with that.
Emotion states can be learned based on human experience.

Human need would need just feed information about emotion, at given facial state

Once you have mimics trained by emotions states you can replay them by AI.

Now is more tricky part.
How to associate and react, to environment. For example, response to user interaction. When someone spoke for example.

Best way, is to look in chat bots tech.

Missing bit will be, associate text and sentences, with actually user responses.
Typical chat does not records human facial mimics. But you would need collect such data somehow.
And for example teach bot, that if someone say "Get lost", is typically upset, annoyed, or angry.
That gives some clues for AI, to replicate mimic. Or even respond. If someone is happy, AI should use responses rather happy at most neutral, rather than being angry.

Oops...

"Unity", Unity logos, and other Unity trademarks are trademarks or registered trademarks of Unity Technologies or its affiliates in the U.S. and elsewhere (more info here). Other names or brands are trademarks of their respective owners.