Announcing: Motion detection for Azure Media Analytics

Azure Media Analytics is a collection of speech and vision services offered at enterprise scale, compliance, security and global reach. The services offered as part of Azure Media Analytics are built using the core Azure Media Services platform components and hence are ready to handle media processing at scale on day one. For other media processors included in this announcement, see Milan Gada's blog post, Introducing Azure Media Analytics.

We are very excited for the free public preview of the Azure Media Motion Detector, and this blog will detail the use and output of this technology. This Media Processor (MP) can be used with static camera footage to identify where in the video motion occurs. Targeted towards security video feeds, this technology is able to differentiate between real motion (such as a person walking into a room), and false positives (such as leaves in the wind, along with shadow or light changes). This allows you to generate security alerts from camera feeds without being spammed with endless irrelevant events, while being able to extract moments of interest from extremely long surveillance videos.

Motion Detection

Motion detection takes a video and JSON configuration as input and generates a JSON file containing this metadata with timestamps and the bounding region where the event occurred.

You can access these features in our new Azure portal, through our APIs with the presets below or by using the free Azure Media Services Explorer tool.

Input Configuration

Currently, there are no input configuration options required, and you can use the preset below.

Output Format

A Motion Detection job will return a JSON file in the output asset which describes the motion alerts, and their categories, within the video.

Currently, motion detection supports only the generic motion category, which is referred to as type 2 in the output.

X and Y coordinates and sizes will be listed using a normalized float between 0.0 and 1.0. Multiply this by the video height and width resolution to get the bounding box for the region of detected motion.

Each output is split into fragments and subdivided into intervals to define the data within the video. Fragment lengths do not need to be equal and may span long lengths when there is no motion at all detected.

Understanding the Motion Detection output

Overview

The Motion Detection API provides indicators once there are objects in motion in a fixed background video (e.g. a surveillance video). The Motion Detection is trained to reduce false alarms, such as lighting and shadow changes. Current limitations of the algorithms include night vision videos, semi-transparent objects and small objects.

The output of this API is in JSON format, consisting of both time and duration of motion detected in the video.

JSON Reference

The motion detection JSON has similar concepts to the face detection and tracking JSON. Because the motion detection JSON only needs to record when motion has happened, and the duration of motion, there are a few differences.

Version: This refers to the version of the Video API.

Timescale: “Ticks” per second of the video.

Offset: This is the time offset for timestamps in ‘ticks’. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change.

Framerate: Frames per second of the video.

Width, Height: Refers to the width and height of the video in pixels.

Regions: Regions refers to the area in your video where you care about motion. In the current version of Video APIs, you cannot specify a region. Instead, the whole surface of the video will be the area of motion that will be detected.

ID represents the region area – in this version there is only one, ID 0.

Rectangle represents the shape of the region you care about for motion. In this version, it is always a rectangle.

The region has dimensions in X, Y, Width and Height. The X and Y coordinates represent the upper left hand XY coordinates of the region in a normalized scale of 0.0 to 1.0. The width and height represent the size of the region in a normalized scale of 0.0 to 1.0. In the current version, X, Y, Width and Height are always fixed at 0, 0 and 1, 1.

Fragments: The metadata is chunked up into different segments called fragments. Each fragment contains a start, duration, interval number and event(s). A fragment with no events means that no motion was detected during that start time and duration.

Start: The start timestamp in “ticks”.

Duration: The length of the event, in “ticks”.

Interval: The interval of each entry in the event, in “ticks”.

Events: Each event fragment contains the motion detected within that time duration.

Brackets [ ]: Each bracket represents one interval in the event. Empty brackets for that interval mean that no motion was detected.

Type: In the current version, this is always ‘2’ for generic motion. This label gives Video APIs the flexibility to categorize motion in future versions.

RegionID: As explained above, this will always be 0 in this version. This label gives Video APIs the flexibility to find motion in various regions in future versions.