Some love for the Money Pit

While doing the breakneck bring-up of Onyx X for Mech Warfare, the Money Pit has been collecting dust in a closet.
With the release of NVIDIA Jetson TX2, I figured it's time to do something about that!
I'm designing and training a neural network to use in autonomous navigation. Here are some pictures from the data set I'm labeling. (The dirty secret of machine learning is that all the hard work is in training data preparation!)

Again: These are manually labeled. We'll see how far the network can take it in a later post.

The target is one or both of the challenges of Robomagellan, and the SparkFun AVC.

Re: Some love for the Money Pit

I have taken learnings from a few models and folded them into a "fully convolutional semantic segmentation" model, that I'm currently training.

Using a single high-end GPU, training can take many hours. If I were to use the Jetson for training, it might take days, so I'd rather pay amazon a nickel an hour to rent a GPU compute instance. The spot price for a "K80" based instance is about $0.12/hour. The spot price for 8x GPUs is about $2 (which you will note is more than 8x the single-GPU price) and the price for 16x GPUs is about $3.50. Although these prices fluctuate over the day. I've found that many models don't get 16x speed-up on 16 GPUs, but this "semantic segmentation" actually does, probably because there's so much work involved in every image.

Here's an example image, and its current classification. Note that it's started figuring out that traffic cones are of interest, just barely. By comparison, the training "label" image is also here:

Other interesting facts:

My target input video is 320x240.

The actual training images I use are random 256x160 subsets of the labeled 320x240 images I have, and those subsets may include the "don't care" border on various sides. I do this to make the model more robust to position within the image. I also use two separate "dropout" layers to make sure the model doesn't over-fit the training images. Finally, there's a "bypass" path where higher-resolution, but less-processed, data, gets fed forward to the final "classification" convolution layer, to help determine more specific boundaries between feature areas. The highest-layer features are just 16x10, so feeding 64x40 information into that will help with precisely locating the border between different segment areas.

What I find encouraging is that the net seems to have learned that the green bushes/trees and gray branches are not okay to drive on, but the green grass and gray pavement is.

If you don't know neural networks, that probably makes no sense :-)

Here's some training images:

Note the gray "don't care" color of the rover that was hosting these videos. Don't want to burn the image of Jesse's rover into the network :-)

Re: Some love for the Money Pit

I wonder what will happen if I add stereo input at some point ...

Right now, the global shutter genlocked camera situation for Jetson TX2 is really bad. As in, there are none supported yet. Something changed in the device tree between TX1 and TX2 so that the cameras painstakingly developed for TX1 all need driver updates :-(

Re: Some love for the Money Pit

Another picture.

This is from the AVC, where the "hoop" is a green steel pipe hoop to drive through (this particular rover just kind-of carried it along) and the "ramp" is a blue ramp that will let your rover leave ground; both of these extra goals earns extra points.
Both of them are surprisingly challenging to make the general segmentation algorithm do a strong job of, because the dark green becomes blackish against the sky (and thus looks like light poles, which are not of interest,) and the blue ramp is flattish and dirty, and thus looks a fair bit like the pavement it's placed on.
With more input data, I'll probably get there (I still have tons of video footage to label!)
Or I'll use a "detect" instead of "segment" model for these cases. Might be useful for safety, too, "detect" on a person/pedestrian/child would let the rover stop/avoid as appropriate.

Re: Some love for the Money Pit

Thinking about how Money Pit failed at last year's Robogames, the one mechanical failure (not software bug, not poorly tightened electrical screw connection) was when the AX-12A used for steering each individual wheels were overloaded in the tall, grippy grass.

I can upgrade to XM430-W350-T servos. At a ... "small" ... additional cost.
Or I can add software to attempt to "wiggle" the wheels when steering.
Or I can add a third pair of wheels in the center, balanced to take more of the weight (say, 50%, and 25% each on front/back.)
This center pair of wheels wouldn't need steering, but it'd be nice to keep them driven and receive encoder feedback.
The question then becomes: Do I feel up to machining yet another pair of suspension arms? We'll see.

Re: Some love for the Money Pit

Nice project. Have you considered using a micro-linear-actuator for steering? These seem to be a bit more expensive than the ax12, but might handle the load better. If you can rig it to only use 1 actuator per 2 wheels, it might end up cheaper.

I see one trick the "big guys" are using when training self-driving cars is to use 3 cameras - not in stereo - but at 3 angles - left / straight / right. This way, if the left view looks more like "what you want" then you can go left, or if it looks "more dangerous" then you can go right. This way you do not need to predict what "going left" might look like, and you can train with "going left" data without actually going off track.

I guess this is more for where you want to just predict "left/go/right" rather than doing a per-pixel assignment followed by some reasoning on those pixels, but I like the simplicity of the approach.

Re: Some love for the Money Pit

If you can rig it to only use 1 actuator per 2 wheels, it might end up cheaper.

Ah! The history of this project is that it is four independent suspensions with four independent steerings.
I considered linear actuators (four of them) in the beginning, but couldn't find a good mechanical solution for it that I liked.

Here are some pictures of the rover on my desk right now:

That display with the Raspberry Pi underneath it will be replaced with the Jetson TX2. I haven't yet decided on what carrier board to use; I might just use the NVIDIA board that came with it, although it's kind-of big.

I like the simplicity of the approach

Certainly! Another option is to use a single fish-eye camera, and reproject to three separate views, that I can do inference on. With the resolution of modern sensors and the small image size I use, that could totally work well. Also: I actually started out by cutting out 32 wide by 128 tall sections, and classifying the bottom-center of those into "stop" and "go" only. (And the "big boy" term for this is "open space segmentation" if I get my buzzwords right :-)