YOLO (or other object detection algorithms) gives us a list of detections for each frame, but doesn’t assign an unique identifier to those detections. This means that on the next frame you do not know if this red car is the same:

This is our Problem.

What we were looking is to enrich the YOLO detections with an unique id for each object that would track them across the scene. YOLO input

currentlyTrackedObjects = []
For each frame:
// 1. Try to match the currently tracked object with the detections of this frame
For each detections:
doesMatchExistingTrackedObject(detection, currentlyTrackedObjects)
// if it matches, update the trackedObject position
matchedTrackedObject.update(detection)
// 2. Assign unmatched detections to new objects
currentlyTrackedObjects.add(new trackedObject(detection))
// 3. Clean up unmatched tracked objects
For each currentlyTrackedObjects:
if isUnmatched(trackedObject):
trackedObject.remove()

We immediately saw that one of the challenge would be to define the doesMatchExistingTrackedObject()function that compare two detections, how to determine if they are tracking the same object ?

This leads to write a distance() function, which compare two detections positions (current detections and candidate for next frame) and determine their relative distance. If they are considered close enough, we can match them. (we used the center of the bbox to compute this distance)

This early implementation was already pretty good and matching correctly ~80% of the detections, but still had lots of re-assignments (when we lose track of the object and we assign it a new id even if it is the same object).

At that point we had some ideas on how to improve it:

By keeping a memory of the unmatched item for a few frames and avoid removing them directly (sometimes the detection algorithms miss the object for a few frames)

The idea behind this to be able to predict the next position of the tracked object if it is missing in the next frame, so it moves to its “theorical” position and will be more likely to be re-matched on the next frame.

Not real time tracking: algorithms that run on an existing video slower than real time (ie, takes more than a frame time to compute the next state of the tracker)

Real time tracking: the tracking algorithms that can run in real time (necessary for self-driving cars, robot ... )

We noticed that the real time algorithms use only the detections inputs (done by YOLO for ex), versus the non real time trackers use information from the image frames to get more data to do the tracking. Also most of the papers and projects published are focusing on single object tracking and not MOT.

Historically there was almost no algorithm working only on the detections output only as the detections weren't as good / fast as the recent progress of neural network based detector such as YOLO, and they needed to get more data from the image frame to do their job. But now this is changing and the tracking algorithms get simpler and faster as the detections are better. This technique is also called doing "tracking by detections".

“Due to recent progress in object detection, tracking-by-detection has become the leading paradigm in multiple object tracking.”

That said there are maybe tracking approaches using image data that lead to better tracking results than just "tracking by detections" , but we haven’t looked into them as they will be more complex to run and may likely not run in real time: for example color based tracking, particle tracking ...

We decided to focus on real-time tracking only as if it works with realtime tracking it will works on a pre-recorded video ... (we kept in mind that for the game app we have the option of using non real time tracking algorithms)

Also our problem is simplified because we mainly track cars, and we have a fixed camera viewpoint.

The basis of the algorithm is the same as ours, it compares a frame with the next using dimensions like position of the bbox, size of the bbox and compute a velocity vector. It does have novelties compared to our approach:

It uses Kalman filters to compute the velocity factor: Kalman filter is essentially doing some math to smooth the velocity/direction computation by comparing the predicted state and the real detection given by YOLO. (and I think it smooth out also the size of the bounding box of the predictions)

Its uses an assignment cost matrix that is computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets (which is putting all the dimensions in a normalized matrix). Then the best match is computed using the Hungarian Algorithm, which is a way to fastly compute lots of matrices …

It also handles the score of the detections (how confident YOLO is of that detection) which we didn't use. Could be interesting to see if it helps, the tracker could choose between two close detections based on that.

We also found some limitations during the exploration of this approach:

Does not handle re-entering: that means that if the tracker loose track of something (generally because of YOLO not having that detection for a few frames), when the tracker gets the object back, it will give it a new id, which is bad for us as for the game it means that the masking is lost…

The velocity computation is not based on several frames: We've found out that with our algorithm it was better to compute velocity model based on the average of few frames back

Out of the box not that great. The main problem being that there is high number of identity switches (as it does not handle re-entering). But it does perform better for some cases where our tracker is losing tracking.

Also, and this is true for all trackers of the MOT benchmark, the are optimized for persons, not cars, we didn't try with persons as we didn't shoot footage of persons yet, but we can hope that it performs way better than our algorithm for this.

Even if we were a bit disappointed with the raw results after playing a bit with it (changing some parameters), we could take away some ideas that would help improve our algorithm, like integrating the Kalman Filters to make better predictions.

We also started to notice that the algorithms from the MOT challenge were optimized for persons, and might now work as well with our use case of cars.

NOTE: some improvement has been made on SORT with https://github.com/nwojke/deep_sort , which using deep learning with a model trained on pedestrian to handle the occlusions scenario / re-entering. We didn't test it but it says that it should run in Realtime also.

Surprisingly great ! We think out of the box it may not be as good as our current algorithm tracker for some case because YOLO is missing detections quite a lot of times and triggers lots of re-assignments with this tracker... but it had a huge potential of improvement if we add prediction + re-entering features.

Based on the previous learning, we simply integrated to the tracker the distance() function of the IOU paper instead of reasoning on euclidean distances, this led to lots of improvements and making the tracker much more reliable.

We also found out that with some objects like trucks, yolo sometimes make double detections for the same objects, we fixed that by no necessarily assigning all the detections to tracked items, this avoided some id-reassignment as well.

Some limitation of node-moving-things-tracker is that it was mostly tested on tracking cars, and could be over optimized for this use case and perform badly on other use cases. Also it was only tested on fixed camera viewpoint.

To improve it further, it could be a good idea to work on the prediction framework, by integrating Kalman filters, and also integrate the confidence on the detection given by YOLO which isn’t used at the moment.

Another path for improvement would be to use some machine learning techniques to learn from a fast variety of traffic footage, the only thing that would be time consuming is to produce a large enough training set.