Microsoft Azure Stack is an extension of Azure—bringing the agility and innovation of cloud computing to your on-premises environment and enabling the only hybrid cloud which allows you to build and deploy hybrid applications anywhere. We bring together the best of the edge and cloud to deliver Azure services anywhere in your environment.

Announcing face and emotion detection for Azure Media Analytics

Azure Media Analytics is a collection of speech and vision services offered at enterprise scale, compliance, security and global reach. The services offered as part of Azure Media Analytics are built using the core Azure Media Services platform components and hence are ready to handle media processing at scale on day one itself. For other media processors included in this announcement, see Milan Gadas blog post Introducing Azure Media Analytics.

We are very excited for the free public preview of the Azure Media Face Detector, and this blog will detail the use and output of this technology. This Media Processor (MP) can be used for people counting, movement tracking, and even gauging audience participation and reaction via facial expressions. You can access these features in our new Azure portal, through our APIs with the presets below, or using the free Azure Media Services Explorer tool. This service contains two features, Face Detection and Emotion Detection, and I’ll be going over their details in that order.

Face detection

Face detection finds and tracks human faces within a video. Multiple faces can be detected and subsequently be tracked as they move around, with the time and location metadata returned in a JSON file. During tracking, it will attempt to give a consistent ID to the same face while the person is moving around on screen, even if they are obstructed or briefly leave the frame.

Note: This services does not perform facial recognition. An individual who leaves the frame or becomes obstructed for too long will be given a new ID when they return.

Input preset

Here’s and example of a JSON configuration preset for just face detection.

Emotion detection

Emotion detection is an optional component of the Face Detection Media Processor that returns analysis on multiple emotional attributes from the faces detected, including happiness, sadness, fear, anger, and more. This data is currently returned as an aggregate value of the whole window over a customizable window and interval.

Understanding the output

The face detection and tracking API provides high precision face location detection and tracking that can detect up to 64 human faces in a video. Frontal faces provide the best results, while side faces and small faces (less than or equal to 24x24 pixels) are challenging.

The detected and tracked faces are returned with coordinates (left, top, width, and height) indicating the location of faces in the image in pixels, as well as a face ID number indicating the tracking of that individual. Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.

JSON reference

For the face detection and tracking operation, the output result contains the metadata from the faces within the given file in JSON format.

The face detection and tracking JSON includes the following attributes:

Version: This refers to the version of the Video API.

Timescale: “Ticks” per second of the video.

Offset: This is the time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change.

Framerate: Frames per second of the video.

Fragments: The metadata is chunked up into different segments called fragments. Each fragment contains a start, duration, interval number, and event(s).

Start: The start time of the first event in ‘ticks’.

Duration: The length of the fragment, in “ticks”.

Interval: The interval of each event entry within the fragment, in “ticks”.

Events: Each event contains the faces detected and tracked within that time duration. It is an array of array of events. The outer array represents one interval of time. The inner array consists of 0 or more events that happened at that point in time. An empty bracket [ ] means no faces were detected

ID: The ID of the face that is being tracked. This number may inadvertently change if a face becomes undetected. A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.)

X, Y: The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0.

X and Y coordinates are relative to landscape always, so if you have a portrait (or upside-down, in the case of iOS) video, you'll have to transpose the coordinates accordingly.

Width, Height: The width and height of the face bounding box in a normalized scale of 0.0 to 1.0.

facesDetected: This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. Because the IDs can be reset inadvertently if a face becomes undetected (e.g. face goes off screen, looks away), this number may not always equal the true number of faces in the video.

The reason we have formatted the JSON in this way is to set the APIs up for future scenarios; where it will be important to retrieve metadata quickly and manage a large stream of results. We use both the techniques of fragmentation (allowing us to break up the metadata in time-based chunks, where you can download only what you need), and segmentation (allowing us to break up the events if they get too large). Some simple calculations can help you transform the data. For example, if an event started at 6300 (ticks), with a timescale of 2997 (ticks/sec) and framerate of 29.97 (frames/sec), then:

· Start/Timescale = 2.1 seconds

· Seconds x (Framerate/Timescale) = 63 frames

Below is a simple example of extracting the JSON into a per frame format for face detection and tracking: