Algorithmia Blog - Deploying AI at scale

October 26, 2017

Deep Dive into Object Detection with Open Images, using Tensorflow

The new Open Images dataset gives us everything we need to train computer vision models, and just happens to be perfect for a demo! Tensorflow’s Object Detection API and its ability to handle large volumes of data make it a perfect choice, so let’s jump right in…

Open Images is a dataset created by Google that has a significant number of freely licensed annotated images. Initially it contained only classification annotations, or in simpler terms it had labels that described what, but not where. After a major version update to 2.0, more annotations were added – of particular importance were the introduction of object detection annotations. These new annotations not only described what was in a picture, but where it was located, by defining the bounding box (bbox) coordinates for specific objects in an image.

The object detection dataset consists of 545 trainable labels. These labels consist of everything from Bagels to Elephants – a major step up compared to similar datasets such as the Common Objects in Context dataset, which contains only 90 labels for comparison. Not only that, but the labels in Open Images contain a hierarchical structure. This means it’s even possible to create specialist classifiers for individual subsections of the whole dataset, wow!

This tutorial will describe the steps in detail of how to create your own object detector trained on the Open Images dataset and how to export it to the Algorithmia marketplace.

Before we go any further, we should let you know about some caveats regarding this demo.

Caveats

This deep dive tutorial assumes that you have a good working knowledge of git, python, bash, and conventional linux operations. Our example is strictly defined within the debian/linux operating system environment, however, and with some tweaking it should work for most other environments.

The complete dataset is ~6.2 TB downloaded and uncompressed. You might want to tweak our image downloader to resize images as they come in.

The Tensorflow framework is super memory hungry. It will expect to have sufficient host memory to run, otherwise it will crash with difficult-to-decypher exceptions. It’s recommended you have at least 32GB of RAM, although you can use scratch space instead.

The Open Images dataset is comprehensive and large, but many of its classes are unbalanced, which affects our precision of underrepresented classes. As this is an introductory tutorial, we leave more comprehensive dataset improvements, such as SMOTE, to the reader.

Tensorflow Object Detection

As the namesake suggests, the extension enables Tensorflow users to create powerful object detection models using Tensorflow’s directed compute graph infrastructure. It’s crazy powerful, but a little difficult to use as the documentation is a bit light. In this article we’ll walk you through each step and describe why.

In the Open Images dataset, all data is formatted in the CSV format. CSV is great for having a low footprint and easy for spreadsheets to parse. However, as a format it isn’t very human readable and there are other alternatives that are easier to work with programmatically. For these reasons we decided to convert our annotations and images files into JSON, so we can work with them in a simpler fashion.

It should also be mentioned that the annotations file contains 600 different labels, only 545 of them are strictly trainable. We’re going to need to cross-reference with thetrainable-classes.txt file to filter out only the trainable labels.

The image index file contains the image url and ID for every image in the entire dataset, even images that don’t contain bbox annotations!

Translating Class Definitions

The trainable_classes.txt file contains encoded labels, which is totally fine for training but can be a headache during evaluation. Lets quickly use the class_descriptions.csv file to create a translated trainable classes file.

As you can see, we perform a simple string replacement (with filter) for each element, in exactly the same format as the original trainable_classes.txt file. This will help us considerably when it comes time for evaluation and inference, so it’s good that we got it out of the way first.

def dedupe(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]

We then follow suit with our image index file by again translating CSV rows into JSON elements. It should be noted that the image indices file contains vast quantities of image related metadata, however, in our circumstance we only care for the image id and the URL.

Filtering is done by constructing an output array consisting only of image indicies that contain ids that have bounding box annotations, and all other elements are removed.

# Lets check each image and only keep it if it's ID has a bounding box annotation associated with it.
def filter_image_index(dataset, ids):
output_list = []
for element in dataset:
if element['id'] in ids:
output_list.append(element)
return output_list

We then construct an easier to use primitive by refactoring our annotations, grouping them based on image ids. We call these grouped elements “points” for clarity.

Image Downloading

As many of you might have realized, downloading ~660k web scaled images is a monstrous task. Thankfully downloading images is partially an asynchronous task, which is something we can take advantage of by multi-threading our application.

First, let’s look at our parallel processing function as it’s not quite the standard multiprocessing.pool.starmap affair. We like using this specific version since visualizing our code performance is something that matters to us for long running scripts such as this. Essentially what’s important to note is that the array parameter denotes the iterable you plan to parallel map over, and function denotes the function you plan to parallelize.

# This is a nice parallel processing tool that uses tqdm
# to help visualize time-to-completion.
def parallel_process(array, function, n_jobs=16, use_kwargs=False, front_num=3):
"""
A parallel version of the map function with a progress bar.
Args:
array (array-like): An array to iterate over.
function (function): A python function to apply to the elements of array
n_jobs (int, default=16): The number of cores to use
use_kwargs (boolean, default=False): Whether to consider the elements of array as dictionaries of
keyword arguments to function
front_num (int, default=3): The number of iterations to run serially before kicking off the parallel job.
Useful for catching bugs
Returns:
[function(array[0]), function(array[1]), ...]
"""
#We run the first few iterations serially to catch bugs
if front_num > 0:
front = [function(**a) if use_kwargs else function(a) for a in array[:front_num]]
#If we set n_jobs to 1, just run a list comprehension. This is useful for benchmarking and debugging.
if n_jobs==1:
return front + [function(**a) if use_kwargs else function(a) for a in tqdm(array[front_num:])]
#Assemble the workers
with ProcessPoolExecutor(max_workers=n_jobs) as pool:
#Pass the elements of array into function
if use_kwargs:
futures = [pool.submit(function, **a) for a in array[front_num:]]
else:
futures = [pool.submit(function, a) for a in array[front_num:]]
kwargs = {
'total': len(futures),
'unit': 'it',
'unit_scale': True,
'leave': True
}
#Print out the progress as tasks complete
for f in tqdm(as_completed(futures), **kwargs):
pass
out = []
#Get the results from the futures.
for i, future in tqdm(enumerate(futures)):
try:
out.append(future.result())
except Exception as e:
out.append(e)
return front + out

Looking at our download function, we can see that it uses a global save_directory_path defined later in our function, this denotes the directory in which we plan to save our files. Unfortunately in python, most parallel mapping tools do not support “constant” parameter inputs, and in this case it made the most sense to provide this variable as a script specific global.

Our downloader function primarily uses the requests library and attempts to download each image from it’s URL. In this example if for any reason an exception is thrown, we skip that image. Obviously there are situations where this approach is substandard, so use at your own risk.

The successfully downloaded image is saved as a binary stream to a file with it’s name defined by the image id. This makes it easier to search and load images quickly and efficiently.

Whoa, that’s gonna take a while! Make sure that you don’t have bandwidth caps before downloading. ~660k images is a lot of images and we advise you to double check that you have enough storage space to cope.

Image Verification and Dimension Reduction

Now we have a ton of images, but they are all different sizes, and some of them might be broken! Let’s go ahead and verify them, but instead of verifying and resizing in two separate commands, let’s get efficient and combine the verification and resize operations.

We check the image for each label element for validity, first we inspect it and ensure that nothing is broken, if that’s the case we go ahead and re-scale if necessary, if an output directory is not defined, we overwrite.

If anything goes wrong during image processing, we know that the image is not formatted correctly and we filter it out of our label’s list.

Note: Our thumbnail dimensions are set to reduce training cost but aren’t of any particular “standard”. We set something small as to reduce the overhead when creating TFRecords. Some object detection networks are designed to work with a number of image dimensions and aspect ratios, but resizing here is not strictly necessary for training. It does help, though.

Run that process for the training, testing, and validation sets and we’re almost there. If you want to preserve the original files, provide a resized_directory path variable which will define where we save the resized/verified images to.

Defining the Label Map

Tensorflow requires a label_map protobuffer file for evaluation, this object essentially just maps a label index (which is an integer value used in training) with a label keyword. If you train without an evaluation step you can avoid this, however it will help when performing inference later.

There are no available writing tools to generate label_map files for Tensorflow, and for large label sets like ours it can be super cumbersome to write one manually. Because of this we decided to create an automated string replacement tool that satisfies the label map format requirements.

The last step before we start constructing our model is to create TFRecord files.

TFRecord Creation

Tensorflow records are an interesting construct. They’re used nearly universally across Tensoflow objects as a dataset storage medium, and harbour a bunch of complexity, but the documentation on using your own dataset is sparse.

Thankfully we did all the hard work for you. This section will walk you through everything you need to start using a Tensorflow record!

First we must generate a “class number” or label index integer for each label. These integers are used directly by the neural network’s cross-entropy loss function, which is used to gauge the performance of the network in the classification task. We define the class number based on the order in which they are defined in the trainable_classes file.

To create a record for an object detection project, we need a few components. Some are on a per image basis while some are per annotation.

Unfortunately the API for creating “examples” or single elements in a TFRecord is a bit convoluted. You don’t provide an array of annotations, but instead a series of arrays for each individual component of an annotation. For these “per annotation” components, we include bounding box coordinates, the labels “text” or definition, and a unique integer value to denote that particular class.

Note: If using your own dataset, make sure that your bounding box coordinates are relative to the image coordinates, rather than absolute. If your dataset’s annotation data is defined in absolute coordinates, make sure you convert them to relative coordinates before resizing your images! We almost got burned by that, learn from us 😀

First, we load the image file for this particular point and encode it as a byte array. It’s important to load the image this way since the object detection API’s internal image handling logic is fragile and may not represent images how you would expect them to.

Next, we iterate over the point object’s annotations and create element arrays. It should be noted that if any of the object arrays are missing or not the same length, Tensorflow will throw a bunch of exceptions.

Now that we’ve created what is analogous to a “row” in a database, we should write the data to a file – a TFRecord file!

While we have the write logic contained within the scripts main procedure for brevity, it could easily be placed in a separate function if you’re so inclined.

Great we’re almost there now. We compiled the individual processing scripts into a series of gists for your use. Be sure to run these steps multiple times to create the Training, Testing, and Validation TFRecord files as we’ll need them for our next step.

Step 2: Setting up the Object Detection API

So all of our data is formatted properly into TFRecords files, and we’re just about ready to begin training. At this point we should start introducing elements from the object detection API.

The object detection API contains a couple of useful scripts that we can take advantage of. Namely, the eval.py and train.py scripts in the main directory. Installation is a bit of a pain though, so we’ll walk you through a quick setup to get things moving.

Setting up our Environment

First, you’ll need to get your system dependencies in place, so boot up a terminal and follow along!

That should get you along nicely, now lets make sure our environment variables are set (you may want to permanently set them, more on this here):

# while still in the models/research directory
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
# and if you're using tensorflow-gpu and haven't set your cuda path yet:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Great now we’re fully setup, if you want to run a quick test to make sure everything works, try running this and see if it works:

# while still in the models/research directory
python object_detection/builders/model_builder_test.py

Transfer Learning

Object detection is a difficult challenge that necessitates the use of deep learning techniques. This normally requires that we train a model with potentially hundreds of layers and millions of parameters! As you might imagine even our 660k image dataset would most likely be insufficient.

Thankfully there’s a solution! All object detection model configurations in the Object Detection API support transfer learning. What this means is that we’re able to take an existing pre-trained image classifier (which is trained on millions of images), and use it to jump start our detector.

Exactly how transfer learning works is beyond the scope of this deep dive, but to get a more intuitive understanding I recommend you check out the link above.

Great, so we can use pre-trained models, but where do we get them from?

Good question! Deep in the object detection API repository you can find this handy guide, which describes each classifier model. All of them are easy to swap in and out which is very convenient for testing.

So go ahead and download one of these files, and unzip them to a special directory – this will help us later.

Configuring our Object Detection Schema

We’ve accomplished a lot here and we’re almost ready to start training, but first we need to configure our graph buffer configuration.

In the Object Detection API, the standard way of defining a model for training is by creating or tweaking a config file. This file defines how tensorflow interprets your request, where to take data from and where to save data to.
There is a bunch of information that’s contained within this file, so lets break it down into manageable chunks.

This is the start of the model configuration. We’re using the faster_rcnn object detection template here, which is where the faster_rcnn object comes from. This can be replaced with other architectures by contrasting with this page, but in in this demo we’ll only be looking at faster_rcnn.

num_classes is the total number of classification labels, with 0 denoting the background class.

The image_resizer is important, and there are two main types of resizing, fixed_shape_resizer and keep_aspect_ratio_resizer. Image dimensionality is important for object detection. It should be noted that fixed_shape_resizer will pad the minor dimension instead of skewing or warping, which greatly improves stability in the face of natural web images.

Lots of boilerplate stuff right? Still, it’s important for tensorflow to understand exactly how to construct it’s computational graph, and exposing that level of detail gives you more fine grained control when you need it.

batch_size – this defines the number of work elements in your batch. Tensorflow requires a fixed number and doesn’t take into consideration GPU memory or data size. This number is highly dependent on your GPU hardware and image dimensions, and isn’t strictly necessary for quality results. Tensorflow requires each input array to have the same dimensionality, which means that any batch_size > 1 requires an image_resizer of fixed_shape_resizer. For more information on batching, check out this link.

optimizer – this is super important as it defines how your weights get updated by backpropegation. The default mode is a standard momentum_optimizer which is a flexible version of stochastic gradient descent (SGD). This works great for most kinds of systems, but for large sparse arrays like our output array the adam optimizer works best. If you want to check out the other options, look at this file.

fine_tune_checkpoint – here we define the directory and filename prefix of our pre-trained model file. This is why we saved the file in a directory all on its own. Don’t worry about the fact that you don’t have a model.ckpt file, Tensorflow will figure it out.

from_detection_checkpoint: True – not described in any of the documentation, but required for your pretrained object detection checkpoint to work correctly. If you use a pure “classification” checkpoint, leave this as false.

batch_queue_capacity – another important parameter, Tensorflow contains a streaming pipeline that allows you to load a reservoir of training batches into memory, but isn’t dynamically set by your available host memory. This number defaults to 300 which even with our images being dramatically downscaled, was deemed to be too high for our high performance training machine. Adjust accordingly.

gradient_clipping_by_norm – this is necessary to avoid exploding gradients. We set the value of 10 through experimentation but it can be adjusted.

data_augmentation_options – setting some augmentation options can dramatically increase our dataset’s size, while improving the robustness of our detector. For information as to what options are available, take a look at this file.

Ok, so we’ve got the training_config completed to our liking, our GPU is happily able to chug along now with no nasty OOM errors. Lets take a look at the eval_config:

eval_config: {
num_examples: 3000
num_visualizations: 20
}

Much shorter right? The default parameters are actually for the most part OK here, especially since the evaluation step is mostly for visualizing generalization and robustness. If your system begins to hang when both training & evaluation steps are running, it might be worth it to reduce the num_examples value. If you want to take a look at the whole list of options, check out the eval file.

Pretty simple right? We set shuffle to false because we want to see how the network improves from one evaluation to the next, but you can set that to true if you’d rather get a more stochastic result.

Ok, so far we’ve manipulated and formatted our dataset metadata, downloaded, verified and resized all of our image files and created our record files. We’ve loaded and prepared the object detection API, and now created our config file.

Step 3: Training and Production

Everything is setup to begin training, but first let’s describe the training and evaluation process quickly.

There are two important scripts in the object detection API directory: eval.py and train.py. It’s true that we don’t need to run the eval.py script as it doesn’t contribute to training, however, it provides us with invaluable training insight that can be easily viewed and shared using Tensorboard. Describing how to get Tensorboard setup is outside of the scope of this example, however, the documentation in the link above should be more than enough to get you started.

The following scripts are used at the command line, and should be run in separate terminal sessions. We recommend using the screen tool for simplicity.

The training_output directory will contain the all important checkpoint files necessary for inference and serving once your model is sufficiently trained. Logging to std err means that you’ll have a more verbose output, which is useful for debugging.

With all of those scripts running, you’re on your way to training your neural network! Training may take some time, so make sure to check back with your running Tensorboard instance to inspect the generalization of your model. It should also be noted that the object detection API will not stop when it “runs out of data”, the best way to detect when it’s completed a single pass is when the average precision begins to flat line.

With Tensorboard we can even check out some sample images and see what our evaluation looks like at a glance.

Sweet, looks like we actually trained something that’s able to detect things. Let’s look at putting this into production.

Frozen Graph Generation

Awesome, we’re almost at the finish line now. We’ve trained our model and we like the results, but we can’t easily use our model files for inference in its current format.

Tensorflow has a concept known as exporting a metagraph. Freezing a graph allows us to combine the model structure (the configuration file) along with the weight and gradient data into a single binary protobuffer file.

For most inference techniques, we do that by executing a script called export_inference_graph.py which again is found in the object_detection repository.

After that’s done, you now have this frozen_inference_graph.pb file in your frozen directory. Ignore the rest of the gobbley-gook in there and upload it to the data API, along with our previously defined label_map.pbtxt so we can convert our encoded classes into things like cat, dog, and apple.

Serving Inferences with Algorithmia

We have everything we need now to create a useful algorithm on Algorithmia! Our first step is for you to create a new algorithm and define its language as a python3 algorithm for Tensorflow support. Make sure to state that our algorithm requires access to the internet and requires a GPU for processing, or our inferences will take a boatload of time.

Let’s look at our actual algorithm file now. We’ll break it up into chunks and talk about each section individually.

We need a couple of extra files from the object_detection repository to get things to work, namely the label_map_util.py and string_int_label_map_pb2.py scripts. Both files are provided in our repository

As per all of our standard Python algorithms – we define any constant, reused parameters in advance, particularly files and algorithms that we may be interacting with multiple times. By defining everything in advance we make it easier to change things later.

We also describe the AlgorithmError object, this helps us throw more concise exceptions.

This is our standard Tensorflow object detection preload snippet. Pay close attention to how path_to_model is used to setup the detection_graph object. As you can see it is defined through the global tf object, which makes further refinement of this process tricky.

Our label map gets converted into a category_index, which is useful for easy label lookups in our inference function.

This is a very important component that reduces Tensorflow’s memory hogging nature. It also reduces bottlenecks and OOM errors when running the inference script on algorithmia. If per_process_gpu_memory_fraction is not defined, it defaults to 1.

Defining the allow_growth variable means that we only allocate as much GPU memory as strictly necessary.

This is the big meat and potatoes. This is our main inference function, so let’s unpack this.

We define the GPU memory fraction to an easy 0.6, but it can be adjusted as necessary. We format our image data into a Numpy array, and extract its dimensions for the inference process. We then extract Tensorflow tensor handles that are defined in the output of our graph. After that we actively run the inference step by using the sess.run function.

The inference step is by far the most time consuming process, but after that’s complete we can format the results into a useful form. We filter out boxes with a cross entropy value less than min_score and format it into an easy to parse JSON format.

As you might have noticed we return our results here as updates to our mutable list output instead of a regular return. We’ll show you why we do this in our apply function later.

If the user requires a graphic result, we can use our bounding box on image algorithm to quickly create a graphical representation of our detection results. By using this logic we can quickly create images just like:

Finally, let’s look at our apply function, the heart of any algorithm on Algorithmia. In this function we are provided with an input which can be of multiple types. We first must process this input into an expected schema type which is what the first half of the function is doing.

However as you might notice we’re using some multiprocessing functionality, specifically a managed list, and a Process. Why would we ever want to use a multi-threading suite for what is essentially a sequential algorithm?

Tensorflow today is defined with the global variable tf. When the function inference exits, the variable still contains its set properties and values. One of these values is our GPU memory context, which is only released when the tf variable is released. Because of this, we can run into issues with Tensorflow not releasing GPU memory when it should, which can cause lots of complications later on down the road. By running the Tensorflow application in a separate thread, and then killing the thread, we kill the Tensorflow GPU memory context without influencing performance!

After we extract our results from our managed list, we can quickly finish off with some post processing and return it!

And finally, at the bottom of this script, you can see that we run the load_model() script in a global state. This means that we pre load the frozen graph into host memory, which dramatically reduces API request latency and variability.

And that’s it. We’re done! If you want to see a working demo algorithm of this object detector take a look here.