There haven’t been too many updates as of late because I’ve been working on a paper and presentation…and I’m happy to announce that both were accepted for SPIE’s Smart Structures and NDE for Industry 4.0 ! So if you’ve got time to kill in March, you should drop by and listen to me blather for 20 minutes about a Big Data problem and how an Actor-based architecture makes it easier to handle. Hope to see you there!

Speaking of Actor-based systems, the latest changes to Myriad include a bugfix to the Canny edge detector and a new auto-thresholding option for the same, so please update as soon as possible.

I’ve also added initial support for oversampling to correct imbalances in classes based on the SMOTE ( Synthetic Minority Over-sampling TEchnique) algorithm that may be of interest. Myriad was written to help with Region Of Interest (ROI) detection applications, which often have many more negative (i.e. not ROI) samples than positive (are ROI). The Myriad toolset currently uses RUS (Random UnderSampling) to try to correct this imbalance by basically randomly discarding members of the majority class to reduce the imbalance. The SMOTE algorithm tries to correct the imbalance by oversampling-generating synthetic members of the minority class.

I haven’t fully integrated SMOTE into the toolset yet but it’s available for the DIY-er crowd. My game plan is to make it an option available for cross-validation in Trainer. Intuitively to me it makes sense to apply SMOTE after we’ve split the data into testing and training subsets, i.e. we apply SMOTE to the training set rather than the original dataset. Otherwise our synthetic samples would make it into the testing set and we’d be testing models for their ability to classify real and “fake” data when what we really want is to test for ability to classify real data alone.

If you’ve looked at any of the videos, tutorials, etc. I put together for Myriad, you’ll eventually find something about building distributed applications through Akka’s remoting feature. It’s pretty straightforward to use in Myriad Desktop – just update an address and you’re good to go – but what actually lives on the remote system accepting these calls anyway?

So with that in mind, I’ve put together a little project to show the “what.” ActorPool reads a configuration file, starts up one stage in a Myriad processing pipeline, and waits for incoming data. The idea is that you’d have it start up automatically with the remote system and it’d be available to provide a little extra horsepower when processing data in a Myriad application. In the initial commit I’ve included support for the two stages in Myriad that tend to require the most horsepower – the sliding window stage and the actual Region Of Interest (ROI) detection stage. Also included is a Passive-Aggressive model trained to detect damage in C-scans, bundled with Myriad’s new Canny edge detection algorithm.

To use it, just edit application.conf to your liking; in particular choose what stage to run in the actorpool config block. Basic usage is java -jar /path/to/jar /path/to/application.conf e.g. java -jar target/actorpool-1.0-SNAPSHOT-allinone.jar application.conf. If all goes to plan, ActorPool will log a message with its Myriad address:

One of the habits I’ve formed using PySpark is to have a look at the source code, not just to get a better feel for what’s changed in a new release but also to supplement / supplant the documentation. For instance, if you peruse the docs for serialization, you might be left with the impression that your choices are either fast but limited datatype support or always-works but slow, when there’s actually quite a few serializers available. In fact you can get the best of both worlds if you use the AutoSerializer, which defaults to MarshalSerializer if supported for faster operations but falls back to PickleSerializer when required. I’ve had good results combining AutoSerializer with CompressedSerializer, which compresses / decompresses the data on the fly:

As always with [un-/under-]documented API features you should be cognizant of the fact that they could disappear without any warning, but that said these serializers are used elsewhere in PySpark so you’re probably good…for a while anyway. 🙂

Just a future reminder for myself and anyone else mildly surprised by this configuration option, if you want concurrent jobs in Spark Streaming you need to set concurrentJobs to a value of more than 1 as expected e.g.

The foray into PySpark continues. This week I’ve been tuning a PySpark + Kafka streaming app and read through a few excellent pointers on reasonable settings, e.g. Cloudera’sentries. That second one was particularly interesting while I tinkered with optimizing executors, and lead me to playing with this setup with dynamic allocation enabled:

Use 5 cores per executor e.g. --executor-cores 5 . Let’s call it coresPerExecutor for this writeup.

Figure out the maximum number of executors you plan on using. We’ll call this numExecutors.

Figure out how much memory is available per node in your system. This is memPerNode.

Set the memory per executor as measured in GB according to the formula int(round(coresPerExecutor * (memPerNode - 1) / numExecutors)).

So where do these numbers come from? The first big one is the number of cores per executor, which is essentially the number of cores per JVM if you like. Intuitively you might just set this to the number of cores available on a given node, but Cloudera found that anything over 5 cores and you tend to hit HDFS and limit performance. So it’s a good starting point, but obviously you’d want to try playing with this number for your configuration.

Next up is the number of executors. If you’re lucky enough to be the sole occupant on your cluster, let me say I’m jealous and are you hiring? In this case, you can basically set this to whatever you want. If your use case is more like mine where you expect other jobs to be running at the same time, you might want to dial this back a bit to be a good neighbor. Again, you’ll probably need to experiment a bit to get a nice answer.

Item #3 should be the easiest number to come up with – if all your nodes are homogeneous it’s a single number. If they’re heterogeneous you’ll probably want to find the node with the least memory and use that.

Finally we get to figuring out how much memory to use per executor. Intuitively it might make sense that the memory per executor should be related to the ratio of cores per executor to the total (maximum) number of executors in the Spark application, but why lop off 1GB in the formula? Cloudera found that setting aside around 1GB for overhead (JVM/Spark) was appropriate so that’s what we’re doing in this formula as well. We then round and convert the float to an integer to pass to Spark in configuration.

I should point out that this formula tends to be a little on the conservative side, e.g. Cloudera’s example comes up with 19GB per executor and this formula comes up with 18GB. You might want to tinker with replacing round with ceil and see how that goes.

Even if the calculation is a little conservative, the results so far have been promising. Anecdotally, I started with using 300 executors in the streaming app. After coming up with this formula I was able to drop that down to 150 with no reduction in throughput; dropping to 110 executors came with around a 10% drop. You could of course make the reasonable case that I started with a very poorly-optimized app and this was all low-hanging fruit, and I’d probably agree. But I think that having a back of the envelope calculation handy that maybe starts me off with a slightly-optimized application is still worth a look. 🙂

I’ve been working on an application with PySpark and Kafka lately, and I’ve run into issues trying to push NumPy and Pandas objects around in Spark. PySpark uses Py4J to connect Python to the underlying JVM, and Py4J doesn’t serialize NumPy arrays, data types, etc. for performance reasons (IIRC). There’s probably a better way to handle this but a quick solution is to just convert the “un-serializable” objects into their vanilla Python equivalents, which is what to_primitive does.

Just call to_primitive(some_unserializable_object) before you start tossing it around in Spark.

I’m very pleased to announce that Myriad was selected as a finalist in the Spring 2017 round of AIGrant! Whether we win or not is secondary; I’m grateful for the opportunity and the chance to increase project visibility just by being listed. I’m hopeful that with increased visibility we’ll be able to get more relevant data from calibration standards, actual inspection data, etc. and make more useful models for detecting damage. Each finalist was asked to put together a two minute video demonstration for their entry, and you’ll find Myriad’s embedded below.

Many thanks to Nat Friedman, the judges, and the sponsors for the opportunity!

Serialization. Saving and distributing models just became much easier. The new ROIBundle class bundles the model and any preprocessing operation into a single binary blob. The bundle automatically runs its preprocessor on any incoming data so you don’t have to set preprocessing operations manually. No more naming your models sgd_canny2.myr just to keep track of what you need to properly format data!

Better OpenCL support. AMD’s Aparapi project seems to be in a bit of a lull, so I’ve switched to the Syncleus fork. Among other things it’s dropped the requirement to bundle a native library with Myriad. The NASA snapshot bundles libraries for Linux and Windows, but now that everything is done through Maven we can add OS X to the list of supported platforms.

Bug fixes. Speaking of OpenCL, I fixed a memory leak that should hopefully make convolution kernels a little friendlier to your GPU. There were a few other bugs to be squashed in and around other parts of the toolkit.

Both Desktop and Trainer have been updated to work with Myriad 2.0, and have themselves been tagged as 2.0 Snapshots just to keep everything straight. Be sure to use like with like: models trained with Myriad 1.0 will continue to work with Desktop 1.0 and Trainer 1.0, but will not work with 2.0 and vice versa. Myriad 2.0 does have legacy read and write methods so if you’re comfortable with Java you could write your own converter utility, or if there’s any interest I could whip something up.

The main user-facing changes in both are related to the ROIBundle and its bundled preprocessor. In Desktop you don’t have to bother configuring the Data Preprocessing stage if you’re using a Myriad-based machine learning model. It’s still available if e.g. you’re calling Python or MATLAB and you want to continue doing your data preparation in Desktop.

In Trainer loading a model now sets the preprocessor to “bundled,” meaning that it’ll use whatever’s set for that particular bundle. When you save a trained model, Trainer creates a new ROIBundle with the model and its preprocessor all wrapped up in a single package. It’s still pretty efficient storage-wise, typically a model and a single preprocessor fits into a few kB of space.

Myriad might be a Java toolkit but that doesn’t mean you can’t use MATLAB, Python, R, or anything else with it for that matter. The NASA Snapshot has support for redirecting standard input and output so anything that can print to the screen and read input will work. The API entry for ExternalROIFinder has all the details but basically all you need to do is have your code print myriad: True when it finds a Region Of Interest and the library takes care of the rest.

Here’s a quick video demonstration – I trained a model in scikit-learn to detect damage in ultrasonic data and wrote a simple Python app to handle input and output.

This was done in an early build of Myriad Desktop – more recent versions include support for configuring a Python or MATLAB interpreter instead of making you browse to the binary yourself.

If your app takes a while to get started, you can either structure it as a long running app to amortize the startup cost or you can simply create many more Myriad workers to run it. That way even if each individual call to your code takes a while to complete, the rest of the data processing pipeline isn’t being held up.