Object Detection: Inspecting your dataset first

When we are working on a public dataset, we take things for granted as the dataset is probably well constructed. And if it is not, someone would have spotted it and had suggested a fix. Take the Pascal VOC dataset as an example, you would do what most people do, that is to combine the 2007 and 2012 dataset. You would use Train+Val as the training set and use the Test as your validation set during training and jump right into training your latest Object Detection mode.

Custom DatasetBut what if you curated your own dataset, or your customer provided the dataset? We should take a look at it and identify any issues early, even before training your model. I used to do that using Bash scripts and sometimes even Perl (anyone still remembers Perl?). Or you could just use Python, like anyone else. Better still, use a python library such as Pandas to visualize.

The very first customer I worked with on a custom model provided me the dataset. I requested that they deliver the dataset like the Pascal VOC format using a great tool VoTT by Microsoft. I’ll write another post on using VoTT and how we can split the workload to multiple engineers and still easily assemble the dataset. In the beginning, there is a lot of to and fro with the customer on the quality of the dataset and eventually, I wrote some simple scripts to create a report of the dataset to see if it is good enough for the project.

In this article, we will use the good old Pascal VOC dataset as an example.

Import CSV into PandasOnce we get the CSV file created, we can import it into pandas and do some visual analysis. But first, a good practice is to print out of head() and info() to allow us to have a quick understanding of the CSV file we created earlier. You will see that each line is a detection of an object with the coordinates of the bounding box, class, difficult level, pose, and truncated(or not). You will also see the year of the image set and the image name. Since each image can contain more than one object, you may see the same image appearing on multiple lines.

We could do a quick count on how many images per image set. Notice we use the drop_duplicates(‘imagename’) to get the correct count.

Next, we look at the split of images for training, validation, and test set across all years. Most people who use Pascal VOC dataset would have train and val are lumped together as training set while the test is used as validation set during the training of the model. This allows us to increase the number of training samples.

We could also look across all detections and check the number of objects per class. Here, we can see that the Pascal VOC dataset have overwhelming number of person class vs all other classes. You would probably want to do some image pre-processing using augmentation to increase the training samples of the other classes. And if this is a custom dataset, you might want to collect more data to make the classes more balanced.

As a further check on individual class, we could look at the split of train/val/test for each class. We want to have a good split to ensure our validation set is a good representation of the training set across all classes.

And finally, for any new datasets, you would want to have a sanity check on the labeling and bounding boxes. In the script, you could randomly select an image by changing objToInspect.

Final wordsThis script is by no means an exhaustive check on your dataset, but I hope it helps you create a base for your own checks of your dataset.

Post navigation

Request for deletion

Notice: JavaScript is required for this content.

About

MC.AI – Aggregated news about artificial intelligence

MC.AI collects interesting articles and news about artificial intelligence and related areas. The contributions come from various open sources and are presented here in a collected form.
The copyrights are held by the original authors, the source is indicated with each contribution.

Contributions which should be deleted from this platform can be reported using the appropriate form (within the contribution).

MC.AI is open for direct submissions, we look forward to your contribution!

Search on MC.AI

mc.ai aggregates articles from different sources - copyright remains at original authors