Computer Vision

Object Tracking in Deep Learning

Object tracking is a field within computer vision that involves tracking objects as they move across several video frames. In this article, we’ll address the difference between object tracking and object detection, and see how with the introduction of deep learning the accuracy and analysis power of object detection vastly improved.

We’ll see some challenges of object tracking compared to static object detection, including re-identification, appearance and disappearance, and occlusion. In the Object tracking algorithms section, we’ll see three commonly used object tracking algorithms that use deep learning methods: SORT, GOTURN and MDNet. Read on to find out about object tracking with deep learning in the real world.

In this article you will learn

What Is Object Tracking?

Object tracking is a discipline within computer vision, which aims to track objects as they move across a series of video frames. Objects are often people, but may also be animals, vehicles or other objects of interest, such as the ball in a game of soccer. Below are impressive results achieved by SORT, a deep learning object tracking algorithm.

Technically, object tracking starts with object detection —identifying objects in an image and assigning them bounding boxes. The object tracking algorithm assigns an ID to each object identified in the image, and in subsequent frames tries to carry across this ID and identify the new position of the same object.

There are two main types of object tracking:

Offline object tracking—object tracking on a recorded video where all the frames, including future activity, are known in advance.

Online object tracking—object tracking done on a live video stream, for example, a surveillance camera. This is more challenging because the algorithm must work fast, and it is not possible to take future frames and combine them into the analysis.

Object Tracking vs Object Detection

Object detection has evolved substantially in the past two decades, with the move from traditional statistical or machine learning approaches to deep learning approaches based on Convolutional Neural Networks (CNN) . The introduction of deep learning improved the accuracy and analysis power of object detection by an order of magnitude.

To some, object tracking is simply an extension of object detection. The creators of a popular algorithm called Simple Online and Realtime Tracking (SORT) make the assertion that modern object detection algorithms can do most of the work of detecting objects and re-identifying in subsequent frames, and object tracking can be reduced to simple heuristics.

Others have developed extensive object training algorithms that work in tandem with object detection, and apply deep learning techniques to carry over an identified object into the next video frames.

Challenges of object tracking compared to static object detection:

Re-identification—connecting an object in one frame to the same object in the subsequent frames

Appearance and disappearance—objects can move into or out of the frame unpredictably and we need to connect them to objects previously seen in the video

Occlusion—objects are partially or completely occluded in some frames, as other objects appear in front of them and cover them up

Identity switches—when two objects cross each other, we need to discern which one is which

Motion blur—objects may look different due to their own motion or camera motion

View points—objects may look very different from different viewpoints, and we have to consistently identify the same object from all perspectives

Scale change—objects in a video can change scale dramatically, due to camera zoom for example

Illumination—lighting changes in a video can have a big effect on how objects look and can make it harder to consistently detect them

Object Tracking Algorithms

In this section, we’ll introduce three popular object tracking algorithms that use deep learning methods: SORT, GOTURN and MDNet.

Simple Online and Real-Time Tracking (SORT)

SORT is an object tracking algorithm that relies mainly on the analysis of an underlying object detection engine. It can plug into any object detection algorithm. The algorithm tracks multiple objects in real time, associating the objects in each frame with those detected in previous frames using simple heuristics. For example, SORT maximizes the IOU (intersection-over-union) metric between bounding boxes in neighboring frames.

Generic Object Tracking Using Regression Network (GOTURN)

GOTURN is trained by comparing pairs of cropped frames from thousands of video sequences. In the first frame (“previous frame”), the location of the object is known, and the frame is cropped to twice the size of the bounding box around the object, with the object centered.

The algorithm then tries to predict the location of the same object in the second frame (“current frame”). The same double-sized bounding box is used to crop the second frame. A Convolutional Neural Network (CNN) is trained to predict the location of the bounding box in the second frame.

Multi-Domain Network (MDNet)

Multi Domain Network (MDNet) is a CNN architecture that won the VOT2015 challenge. The objective of MDNet is to speed up training in order to provide real-time results. The strategy is to split the network into two parts. The first part acts as a generic feature extractor that trains over multiple training sets and learns to distinguish objects from their background. The second part is trained on a specific training set and learns to identify objects within video frames.

So MDNet makes it possible to modify the weights of only the last few CNN layers during training, reducing computation time significantly.

Object Tracking with Deep Learning in the Real World

In this article, we explained the basics of modern object tracking, which relies on deep learning architectures, primarily Convolutional Neural Networks. When you start working on computer vision projects and using deep learning frameworks like OpenCV, TensorFlow, Keras and PyTorch to run and fine-tune these models, you’ll run into some practical challenges:

Tracking experiment source code, configuration and hyperparameters

CNNs have many variations that can impact performance. You’ll need to run many experiments to discover the best hyperparameter values for your problem. Organizing, tracking and sharing experiment data can be a challenge.

Scaling experiments on-premise or in the cloud

CNNs require a lot of computing power, so to run many experiments you’ll need to scale up across multiple machines and GPUs. Provisioning machines and setting them up to run deep learning projects is time-consuming, machines will experience idle time and you may waste resources.

Manage training data

Object tracking algorithms may need to train on thousands of videos, with training sets weighing Gigabytes to Petabytes. You need to copy and re-copy this data to each training machine, which takes time and hurts productivity.

MissingLink is a deep learning platform that can help you automate these operational aspects of CNNs and computer vision, so you can concentrate on building winning image recognition experiments. Learn more about the MissingLink platform.