Intro

The goal of this project is to plan a path for a car to make its way through
highway traffic. You have at your disposal a map consisting of a set of
waypoints. The waypoints follow the center of the road, and each of them comes
with the normal vector. You also know the positions and velocities of nearby
vehicles. Your car needs to obey the speed limit of 50 MPH (22.35 m/s), not
collide with other traffic, keep acceleration within certain limits, and
minimize jerk (the time derivative of acceleration). The path that you need to
compute is a set of successive Cartesian coordinates that the car will visit
perfectly every 0.02 seconds.

Here's the result I got. It's not completely optimal at times, but I used large
safety margins and did not spend much time tweaking it.

Path Planning

Path planning

Udacity recommends using minimum jerk trajectories defined in the Frenet
coordinate space to solve the path planning problem. This approach, however,
struck me as suboptimal and hard to do well because of the distance distortions
caused by the nonlinearity of the coordinate-space transforms between Cartesian
and Frenet and back. Therefore, I decided to reuse the code I wrote for
doing model-predictive control. However, instead of using actuations computed by
the algorithm, I used the model states themselves as a source of coordinates
that the vehicle should visit.

The controller follows a trajectory defined as a polynomial fitted to the
waypoints representing the center of one of the lanes. In the first step, it
computes 75 points starting from the current position of the car. In each
successive step, it takes 25 points from the previously computed trajectory
that the car still did not visit, it estimates the vehicle's state at the 25th
point and uses this estimated state as an input to the solver. The points
produced this way complement the path. This cycle repeats itself every 250 ms
with the target speed given by:

Where \( v_l \) is the speed of the leader and \( d_l \) is the car's distance
from it. Subtracting the proximity penalty tends to make speed "bounce" less
than when using fractional terms. I was too lazy to tune a PID controller for
this task.

Lane selection

The most optimal lane is selected using a simple finite state machine with cost
functions associated with state transitions. Every three seconds the algorithm
evaluates these functions for all reachable states and chooses the cheapest
move.

Lane Changing FSM

The cost function for keeping the current lane (the KL2KL transition)
penalizes the difference in speed between the vehicle and the leader (if any)
and their proximity. The one for changing lanes (the KL2CL transition) does a
similar calculation for the target lane. Additionally, it increases the cost if
a follower is close enough adding a further penalty proportional to its speed.

The start up state allows the car to ramp up to the target speed before the
actual lane-change logic kicks in. Doing so avoids erratic behavior arising from
the penalization of speed differences.

Conclusions

Using the MPC approach has the advantage of letting you tune more parameters of
the trajectory. You can choose how much acceleration and jerk you're going to
tolerate, increasing or decreasing the perceived aggressiveness of lane changes.

I also tried incorporating the predicted trajectories of nearby vehicles into
the solver's cost function, but the car ended up going out of the lane,
accelerating or breaking aggressively to avoid collisions. It is something that
you would do while driving in the real world, but it breaks the rules of this
particular game. Another problem was that I used a Naïve Bayes Classifier
to predict the behavior of other vehicles and it is not accurate enough. The
mispredictions made the car behave erratically trying to avoid collisions with
non-existing obstacles.

Intro

I have recently stumbled upon two articles (1, 2) treating about
running TensorFlow on CPU setups. Out of curiosity, I decided to check how the
kinds of models I use behave in such situations. As you will see below, the
results were somewhat unexpected. I did not put in the time to investigate what
went wrong, and my attempts to reason about the performance problems are pure
speculations. Instead, I just run my models with a bunch of different threading
and OpenMP settings that people typically recommend on the Internet and hoped to
have a drop-in alternative to my GPU setup. In particular, I did not convert my
models to use the NCHW format as recommended by the Intel article. This
data format conversion seems to be particularly important, and people
report performance doubling in some cases. However, since my largest test case
uses transfer learning, applying the conversion is a pain. If you happen to know
how to optimize the settings better without major tweaking of the models, please
do drop me a line.

TensorFlow settings

The GPU flavor was compiled with CUDA support; the CPU version was configured
with only the default settings; the MKL flavor uses the MKL-ML library that the
TensorFlow's configuration script downloads automatically;

The GPU and the CPU setups run with the default session settings. The other
configurations change the threading and OpenMP setting on the case-by-case
basis. I use the following annotations when talking about the tests:

[xC,yT] means the KMP_HW_SUBSET envvar setting and the interop and
intraop thread numbers set to 1.

[affinity] means the KMP_AFFINITY envvar set to
granularity=fine,verbose,compact,1,0 and the interop thread number set to
2.

More information on controlling thread affinity is here, and
this is an article on managing thread allocation.

Tests and Results

The test results are the times it took to train one epoch normalized to the
result obtained using the ti-gpu configuration - if some score is around 20,
it means that this setting is 20 times slower than the baseline.

LeNet - CIFAR10

The first test uses the LeNet architecture on the CIFAR-10 data. The MKL
setup run with [4C,2T] on ti and [affinity] on m4. The results are
pretty surprising because the model consists of almost exclusively the
operations that Intel claims to have optimized. The fact that ti run faster
than m4 might suggest that there is some synchronization issue in the graph
handling algorithms preventing it from processing a bunch of tiny images
efficiently.

Road Sign Classifier

The second test is my road sign classifier. It uses mainly 2D convolutions
and pooling, but they are interleaved with hyperbolic tangents as activations as
well as dropout layers. This fact probably prevents the graph optimizer from
grouping the MKL nodes together resulting with frequent data format conversions
between NHWC and the Intel's SIMD friendly format. Also, ti scored better than
m4 for the MKL version but not for the plain CPU implementation. It would
suggest inefficiencies in the OpenMP implementation of threading.

Image Segmentation - KITTI (2 classes)

The third and the fourth test run a fully convolutional neural network
besed on VGG16 for an image segmentation project. Apart from the usual suspects,
this model uses transposed convolutions to handle learnable upsampling. The
tests differ in the size of the input images and in the sizes of the weight
matrices handled by the transposed convolutions. For the KITTI dataset, the
ti-mkl config run with [intraop=6, interop=6] and m4-mkl with
[affinity].

Image Segmentation - Cityscapes (29 classes)

For the Cityscapes dataset, ti-mkl run with [intraop=6, interop=6] and
m4-mkl run with [intraop=44, interop=6]. Here the MKL config was as fast
as the baseline CPU configs for the dataset with fewer classes and thus smaller
upsampling layers. The slowdown for the dataset with more classes could probably
be explained by the difference in the handling of the transposed convolution
nodes.

Conclusions

It was an interesting experience that arose mixed feelings. On the one hand, the
best baseline CPU implementation was at worst two to four times slower with only
the compiler optimization than Amazon P2. It's a much better outcome than I had
expected. On the other hand, the MKL support was a disappointment. To be fair,
in large part it's probably because of my refusal to spend enough time tweaking
the parameters, but hey, it was supposed to be a drop-in replacement, and I
don't need to do any of these when using a GPU. Another reason is that
TensorFlow probably has too few MKL-based kernels to be worth using in this mode
and the frequent data format conversions kill the performance. I have also
noticed the MKL setup not making any progress with some threading configurations
despite all the cores being busy. I might have hit
the Intel Hyperthreading bug.

Notes on building TensorFlow

The GPU versions were compiled with GCC 5.4.0, CUDA 8.0 and cuDNN 6. The ti
configuration used CUDA capability 6.1, whereas the p2 configuration used 3.7.
The compiler flags were:

ti: -march=core-avx-i -mavx2 -mfma -O3

p2: -march=broadwell -O3

The CPU versions were compiled with GCC 7.1.0 with the following flags:

ti: -march=skylake -O3

m4: -march=broadwell -O3

I tried compiling the MKL version with the additional -DEIGEN_USE_MKL_VML flag
but got worse results.

The MKL library is poorly integrated with the TensorFlow's build system. For
some strange reason, the configuration script creates a link to libdl.so.2
inside the build tree which results with the library being copied to the final
wheel package. Doing so is a horrible idea because in glibc libdl.so mostly
provides an interface for libc.so's private API so a system update may break
the TensorFlow installation. Furthermore, the way in which it figures out which
library to link against is broken. The configuration script uses the locate
utility to find all files named libdl.so.2 and picks the first one from the
list. Now, locate is not installed on Ubuntu or Debian by default, so if you
did not do:

]==> sudo apt-get install locate
]==> sudo updatedb

at some point in the past, the script will be killed without an error message
leaving the source tree unconfigured. Moreover, the first pick is usually a
wrong one. If you run a 64-bit version of Ubuntu with multilib support, the
script will end up choosing a 32-bit version of the library. I happen to hack
glibc from time to time, so in my case, it ended up picking one that was
cross-compiled for a 64-bit ARM system.

I have also tried compiling Eigen with full MKL support as suggested in
this thread. However, the Eigen's and MKL's BLAS interfaces seem to
be out of sync. I attempted to fix the situation but gave up when I noticed
Eigen passing floats to MKL functions expecting complex numbers using
incompatible data types. I will continue using the GPU setup, so fixing all
that and doing proper testing was way more effort than I was willing to make.

Node 14.07.2017: My OCD took the upper hand again and
I figured it out. Unfortunately, it did not improve the numbers at all.

What is it about?

Semantic segmentation is a process of dividing an image into sets of pixels
sharing similar properties and assigning to each of these sets one of the
pre-defined labels. Ideally, you would like to get a picture such as the one
below. It's a result of blending color-coded class labels with the original
image. This sample comes from the CityScapes dataset.

Segmented Image

Segmentation Classes

How is it done?

Figuring out object boundaries in an image is hard. There's a variety of
"classical" approaches taking into account colors and gradients that obtained
encouraging results, see this paper by Shi and Malik for example.
However, in 2015 and 2016, Long, Shelhamer, and Darrell
presented a method using Fully Convolutional Networks that significantly
improved the accuracy (the mean intersection over union metric) and the
inference speed. My goal was to replicate their architecture and use it to
segment road scenes.

A fully convolutional network differs from a regular convolutional network in
the fact that it has the final fully-connected classifier stripped off. Its goal
is to take an image as an input and produce an equally-sized output in which
each pixel is represented by a softmax distribution describing the probability
of this pixel belonging to a given class. I took this picture from one of the
papers mentioned above:

Fully Convolutional Network

For the results presented in this post, I used the pre-trained VGG16 network
provided by Udacity for the beta test of their Advanced Deep Learning Capstone.
I took layers 3, 4, and 7 and combined them in the manner described in the
picture below, which, again, is taken from one of the papers by Long et al.

Upscaling and merging

First, I used a 1x1 convolutions on top of each extracted layer to act as a
local classifier. After doing that, these partial results are still 32, 16, and
8 times smaller than the input image, so I needed to upsample them (see below).
Finally, I used a weighted addition to obtain the result. The authors of the
original paper report that without weighting the learning process diverges.

Learnable Upsampling

Upsampling is done by applying a process called transposed convolution. I will
not describe it here because this post over at cv-tricks.com
does a great job of doing that. I will just say that transposed convolutions
(just like the regular ones) use learnable weights to produce output. The trick
here is the initialization of those weights. You don't use the truncated normal
distribution, but you initialize the weights in such a way that the convolution
operation performs a bilinear interpolation. It's easy and
interesting to test whether the implementation works correctly. When fed an
image, it should produce the same image but n times larger.

Datasets

I was mainly interested in road scenes, so I played with the KITTI Road
and CityScapes datasets. The first one has 289 training images with two
labels (road/not road) and 290 testing samples. The second one has 2975
training, 500 validation, and 1525 testing pictures taken while driving around
large German cities. It has fine-grained annotations for 29 classes (including
"unlabeled" and "dynamic"). The annotations are color-based and look like the
picture below.

Picture Labels

Even though I concentrated on those two datasets, both the training and the
inference software is generic and can handle any pixel-labeled dataset. All you
need to do is to create a new source_xxxxxx.py file defining your custom
samples. The definition is a class that contains seven attributes:

image_size - self-evident, both horizontal and vertical dimensions need to
be divisible by 32

num_classes - number of classes that the model is supposed to handle

label_colors - a dictionary mapping a class number to a color; used for
blending of the classification results with input image

Normalization

Typically, you would normalize the input dataset such that its mean is at zero
and its standard deviation is at one. It significantly improves convergence of
the gradient optimization. In the case of the VGG model, the authors just zeroed
the mean without scaling the variance (see section 2.1 of the paper).
Assuming that the model was trained on the ImageNet dataset, the mean values for
each channel are muR = 123.68, muG = 116.779, muB = 103.939. The
pre-trained model provided by Udacity already has a pre-processing layer
handling these constants. Judging from the way it does it, it expects plain BGR
scaled between 0 and 255 as input.

Label Validation

Since the network outputs softmaxed logits for each pixel, the training labels
need to be one-hot encoded. According to the TensorFlow documentation,
each row of labels needs to be a proper probability distribution. Otherwise, the
gradient calculation will be incorrect and the whole model will diverge. So, you
need to make sure that you're never in a situation where you have all zeros or
multiple ones in your label vector. I have made this mistake so many time that I
decided to write a checker script for my data source modules. It produces
examples of training images blended with their pixel labels to check if the
color maps have been defined correctly. It also checks every pixel in every
sample to see if the label rows are indeed valid. See here for the source.

Initialization of variables

Initialization of variables is a bit of a pain in TensorFlow. You can use the
global initializer if you create and train your model from scratch. However, in
the case when you want to do transfer learning - load a pre-trained model and
extend it - there seems to be no convenient way to initialize only the variables
that you created. I ended up doing acrobatics like this:

Training

For training purposes, I reshaped both labels and logits in such a way that I
ended up with 2D tensors for both. I then used
tf.nn.softmax_cross_entropy_with_logits as a measure of loss and used
AdamOptimizer with a learning rate of 0.0001 to minimize it. The model trained
on the KITTI dataset for 500 epochs - 14 seconds per epoch on my GeForce GTX
1080 Ti. The CityScapes dataset took 150 epochs to train - 9.5 minutes per epoch
on my GeForce vs. 25 minutes per epoch on an AWS P2 instance. The model
exhibited some overfitting. However, the visual results seemed tighter the more
it trained. In the picture below the top row contains the ground truth, the
bottom one contains the inference results (TensorBoard rocks! :).

CityScapes Validation Examples

CityScapes Validation Loss

CityScapes Training Loss

Results

The inference (including image processing) takes 80 milliseconds per image on
average for CityScapes and 27 milliseconds for KITTI. Here are some examples
from both datasets. The model seems to be able to distinguish a pedestrian from
a bike rider with some degree of accuracy, which is pretty impressive!

A month ago or so, I wrote a post about installing TensorFlow 1.1.0 on
Jetson TX1. This post is an update for 1.2.0 which has one additional issue on
top of the ones discussed previously. The problem is that Eigen is missing some
template specializations when used on ARM. The bug has been fixed, but
you need to make the TensorFlow build use the fixed version.

Intro

I had expected a smooth ride with this one, but it turned out to be quite an
adventure and not one of a pleasant kind. To be fair, the likely reason why it's
such a horror story is that I was bootstrapping bazel - the build software that
TensorFlow uses - on an unsupported system. I spent more time figuring out the
dependency issues related to that than working on TensorFlow itself. This post
was initially supposed to be a rant on the Java dependency hell. However, in the
end, my stubbornness took the upper hand, and I did not go to sleep until it
all worked, so you have a HOWTO instead.

Prerequisites

Jetson TX1

You'll need the board itself and the following installed on it:

Jetpack 3.0

L4T 24.2.1

Cuda Toolkit 8.0.34-1

cuDNN 5.1.5-1+cuda8.0

Dependencies

A Java Development Kit

First of all, you'll need a Java compiler and related utilities. Just type:

]==> sudo apt-get install default-jdk

It would not have been worth a separate paragraph, except that the version that
comes with the system messes up the CA certificates. You won't be able to
download things from GitHub without overriding SSL warnings. I fixed that by
installing ca-certificates and ca-certificates-java from
Debian.

Protocol Buffers

You'll need the exact two versions mentioned below. No other versions work down
the road. I learned about this fact the hard way. Be sure to call autogen.sh
on the master branch first - it needs to download gmock, and the link in
older tags points to the void.

gRPC Java

Building this one took me a horrendous amount of time. At first, I thought that
the whole package is needed. Apart from problems with the protocol buffer
versions, it has some JNI dependencies that are problematic to compile. Even
after I have successfully produced these, they had interoperability issues with
other dependencies. After some digging, it turned out that only one
component of the package is actually required, so the whole effort was
unnecessary. Of course, the source needed patching to make it build on
aarch64, but I won't bore you with that. Again, make sure you use the
v0.15.0-jetson-tx1 tag - no other tag will work.

TensorFlow

Note 18.08.2017: For TensorFlow 1.3.0, see the
v1.3.0-jetson-tx1 tag. I have tested it against JetPack 3.1 which
fixes the CUDA-related bugs. Note that the kernel in this version of JetPack
has been compiled without swap support, so you may want to add
--local_resources=2048,0.5,0.5 to the bazel commandline if you want to
avoid the out-of-memory kills.

Patches

The version of CUDA toolkit for this device is somewhat handicapped. nvcc
has problems with variadic templates and compiling some kernels using Eigen
makes it crash. I found that adding:

#define EIGEN_HAS_VARIADIC_TEMPLATES 0

to these problematic files makes the problem go away. A constructor with an
initializer list seems to be an issue in one of the cases as well. Using the
default constructor instead, and then initializing the array elements one by one
makes things go through.

Also, the cuBLAS API seems to be incomplete. It only defines 5 GEMM algorithms
(General Matrix to Matrix Multiplication) where the newer patch releases
of the toolkit define 8. TensorFlow enumerates them by name to experimentally
determine which one is best for a given computation and the code notes that they
may fail under perfectly normal circumstances (i.e., a GPU older than sm_50).
Therefore, simply omitting the missing algorithms should be perfectly safe.

Memory Consumption

The compilation process may take considerable amounts of RAM - more than the
device has available. The documentation advises to use only one execution thread
(--local_resources 2048,.5,1.0 param for Bazel), so that you don't get the
OOM kills. It's unnecessary most of the time, though, because it's only the last
20% of the compilation steps when the memory is filled completely. Instead, I
used an SD card as a swap device.

]==> sudo mkswap /dev/mmcblk1p2
]==> sudo swapon /dev/mmcblk1p2

At peak times, the entire RAM and around 7.5GB of swap were used. However, only
at most 5 to 6 compilation threads were in the D state (uninterruptable sleep
due to IO), with 2 to 3 being runnable.

Compilation

Then clone my repo containing the necessary patches and configure the source. I
used the system version of Python 3, located at /usr/bin/python3 with its
default library in /usr/lib/python3/dist-packages. The answers to the CUDA
related questions are:

The CPU and the GPU share the memory controller, so the GPU does not have the
4GB just for itself. On the upside you can use the CUDA unified memory model
without penalties (no memory copies).

Benchmark

I run two benchmarks to see if things work as expected. The first one was my
TensorFlow implementation of LeNet training on and classifying the MNIST
data. The training code run twice as fast on the TX1 comparing to my 4th
generation Carbon X1 laptop. The second test was my slightly enlarged
implementation of Sermanet applied to classifying road signs. The
convolution part of the training process took roughly 20 minutes per epoch,
which is a factor of two improvement over the performance of my laptop. The
pipeline was implemented with a large device in mind, though, and expected 16GB
of RAM. The TX1 has only 4GB, so the swap speed was a bottleneck here. Based
on my observations of the processing speed of individual batches, I can
speculate that a further improvement of a factor of two is possible with a
properly optimized pipeline.

The algorithm

I took courses on probability, statistics, and the Monte Carlo methods while I
was at school, but the ubiquity of the algorithms based on randomness still
amazes me. Imagine that you are a robot moving in a 2D space. You have a map of
the area, and you know your rough initial position in it. Every step you take,
you get data from the controls (speed and angular speed) that is quite noisy.
You can also sense obstacles within a certain range and with a certain precision
with respect to your current position and heading. How do you figure out where
you are?

One good strategy to solving this problem is using a particle filter. You
start by creating N "particles," or guesses as to where you are, by drawing the
x- and y- position as well as the heading from the Gaussian distribution around
your initial estimate. Then, every step you take, for every particle, you:

move according to the data you got from controls taking the noise into
account;

match the sensor data to the landmarks on the map using the perspective of
the particle;

assign weight to the particle based on how well the observation matches the
map.

Finally, you draw with replacement N particles from the initial set with the
probability proportional to the weights. The particles that match the
observation well will likely be drawn multiple times and those that don't are
unlikely to be drawn at all. The repetitions are not a problem because the
movement is noisy, so they will diverge after the next step you take. The
particle with the highest weight is your best estimate of your actual position
and heading.

The result

I did an experiment on the Udacity data, and the approach using a 1000
particles turned out to work well comparing to using just one. The average
deviation from the ground truth was around 10cm. Using one particle effectively
ignores the observation data and relies only on the controls. You can follow the
blue diamond in the video below to see how fast the effects of the noise
accumulate. Both cases use the same noise values.

The project

I try to avoid publishing my code solving homework assignments, but this Udacity
SDC project is generic enough to be useful in a wider context. So here you have
it. The task was to fuse together radar and lidar measurements using two kinds
of Kalman Filters to estimate the trajectory of a moving bicycle. The
unscented filter uses the CTRV model tracking the position, speed, yaw, and
yaw rate, whereas the extended filter uses the constant velocity model.

The Unscented Filter result

Both algorithms performed well, with the CTRV model predicting the velocity
significantly better. The values below are RMSEs of the prediction against the
ground truth. The first two values represent the position, the last two - the
velocity.

The code you need to implement yourself depends on the sensor, the model, and
the type of the filter you use. Ie., for the CTRV model and a Lidar measurement
you only need to specify the projection matrix and the sensor noise covariance:

Board bring-up

I started playing with the FRDM-K64F board recently. I want to use it as a
base for a bunch of hobby projects. The start-up code is not that different from
the one for Tiva, which I describe here - it's the same Cortex-M4
architecture after all. Two additional things need to be taken care of, though:
flash security and the COP watchdog.

The K64F MCU restricts external access to a bunch of resources by default.
It's a great feature if you want to ship a product, but it makes debugging
impossible. The Flash Configuration Field (see section 29.3.1 of the datasheet)
defines the default security and boot settings.

The challenge

Here's another cool project I have done as a part of the Udacity's self-driving
car program. There were two problems solve. The first one was to find the
lane lines and compute some of their properties. The second one was to detect
and draw bounding boxes around nearby vehicles. Here's the result I got:

Detecting lane lines and vehicules

Detecting lanes

The first thing I do after correcting for camera lens distortion is applying a
combination of Sobel operators and color thresholding to get an image of
edges. This operation makes lines more pronounced and therefore much easier to
detect.

Edges

I then get a birds-eye view of the scene by applying a perspective transform and
produce a histogram of all the white pixels located in the bottom half of the
image. The peaks in this histogram indicate the presence of mostly vertical
lines, which is what we're looking for. I detect all these lines by using a
sliding window search. I start at the bottom of the image and move towards the
top adjusting the horizontal position of each successive window to the average
of the x coordinate of all the pixels contained in the previous one. Finally, I
fit a parabola to all these pixels. Out of all the candidates detected this way,
I select a pair that is the closest to being parallel and is roughly in the
place where a lane line would be expected.

The orange area in the picture below visualizes the histogram, and the red boxes
with blue numbers in them indicate the positions of the peaks found by the
find_peaks_cwt function from scipy.

Bird's eye view - histogram search

Once I have found the lanes in one video frame, locating them in the next one is
much simpler - their position did not change by very much. I just take all the
pixels from a close vicinity of the previous detection and fit a new polynomial
to them. The green area in the image below denotes the search range, and the
blue lines are the newly fitted polynomials.

Bird's eye view - vicinity search

I then use the equations of the parabolas to calculate the curvature. The
program that produced the video above uses cross-frame averaging to make the
lines smoother and to vet new detections in successive video frames.

Vehicle detection

I detect cars by dividing the image into a bunch of overlapping tiles of varying
sizes and running each tile through a classifier to check if it contains a car
or a fraction of a car. In this particular solution, I used a
linear support vector machine (LinearSVC from sklearn). I also wrapped
it in a CalibratedClassifierCV to get a measure of confidence. I rejected
predictions of cars that were less than 85% certain. The classifier trained on
data harvested from the GTI, KITTI, and Udacity datasets from
which I collected around 25 times more background samples than cars to limit the
occurrences of false-positive detections.

As far as image features are concerned, I use only
Histograms of Oriented Gradients with parameters that are essentially the
same as the ones presented in this paper dealing with detection of humans.
I used OpenCV's HOGDescriptor to extract the HOGs. The reason for this is that
it can compute the gradients taking into account all of the color channels. See
here. It is the capability that other libraries typically lack limiting
you to a form of grayscale. The training set consists of roughly 2M images of
64 by 64 pixels.

Tiles containing cars

Since the samples the classifier trains on contain pictures of fractions of
cars, the same car is usually detected multiple times in overlapping tiles.
Also, the types of background differ quite a bit, and it's hard to find images
of all the possible things that are not cars. Therefore false-positives are
quite frequent. To combat these problems, I use heat maps that are averaged
across five video frames. Every pixel that has less than three detections on
average per frame is rejected as a false positive.

Heat map

I then use OpenCV's connectedComponentsWithStats to find connected components
and get centroids and bounding boxes for the detections. The centroids are used
to track the objects across frames and smooth the bounding boxes by averaging
them with 12 previous frames. To further reject false-positives, an object needs
to be classified as a car in at least 6 out of 12 consecutive frames.

Conclusions

The topic is pretty fascinating and the results I got could be significantly
improved by:

making performance improvements here and there (use C++, parallelize video
processing and so on)

I learned a lot of computer vision techniques and had plenty of fun doing this
project. I also spent a lot of time reading the code of OpenCV. It has a lot of
great tutorials, but its API documentation is lacking.

The project

A neural network learned how to drive a car by observing how I do it! :) I must
say that it's one of the coolest projects that I have ever done. Udacity
provided a simulator program where you had to drive a car for a while on two
tracks to collect training data. Each sample consisted of a steering angle and
images from three front-facing cameras.

The view from the cameras

Then, in the autonomous driving mode, you are given an image from the central
camera and must send back an appropriate steering angle, such that the car does
not go off-track.

An elegant solution to this problem was described in a paper by nVidia
from April 2016. I managed to replicate it in the simulator. Not without issues,
though. The key takeaways for me were:

The importance of making sure that the training data sample is balanced. That
is, making sure that some categories of steering angles are not
over-represented.

The importance of randomly jittering the input images. To quote
another paper: "ConvNets architectures have built-in invariance to small
translations, scaling and rotations. When a dataset does not naturally
contain those deformations, adding them synthetically will yield more robust
learning to potential deformations in the test set."

Not over-using dropout.

The model needed to train for 35 epochs. Each epoch consisted of 24 batches of
2048 images with on-the-fly jittering. It took 104 seconds to process one epoch
on Amazon's p2.xlarge instance and 826 seconds to do the same thing on my
laptop. What took an hour on a Tesla K80 GPU would have taken my laptop over
8 hours.

Results

Below are some sample results. The driving is not very smooth, but I blame that
on myself not being a good driving model ;) The second track is especially
interesting, because it differs from the one that the network was trained on.
Interestingly enough, a MacBook Air did no have enough juice to run both the
simulator and the model, even though the model is fairly small. I ended up
having to create an ssh tunnel to my Linux laptop.