science

Over the holidays I was a little peeved at how inaccurate the weather forecast seemed to be (I swear the uncertainty was +/- 10 degrees C), and that got me wondering about how machine learning would fare. The basic model came together pretty quickly and today I’ve got a bare-bones demo up on PythonAnywhere for a few locales. The plot on the right compares an early (i.e. pre-tweaking) model’s predictions for mean temperature vs. actual temperature readings for a weather station in Petawawa Ontario. I’ve tweaked the model’s settings a bit since this plot so hopefully I’ve squeezed a bit more accuracy out of it, but the standard disclaimers apply: work in progress, don’t plan outdoor weddings or tuna fishing expeditions around its forecast, etc.

The current version of the weather station data from the National Climatic Data Center is fairly straightforward to read with Python and pandas, but the “legacy” file format as used on the old NNDC site and elsewhere is a bit more work. So in case it’s useful here’s a bit of code that I’ve had some luck with; just call the read_ncdc function to get a slightly cleaned up pandas DataFrame for your number-crunching pleasure.

Lately at work I’ve been doing a fair amount of factorial experiments with the excellent pyDOE package: pyDOE generates the full factorial design matrix, I use a “templated” input file to generate every combination of conditions and run each input through LAMMPS or whatever else I’m running at the time. Followed by many regular expressions and pandas sessions for data analysis, but that’s a story for another day.

While I was at it I noticed that while Python had its pyDOE (and similar functionality exists for Octave, MATLAB, R, etc.) I couldn’t find anything similar for Java. So with that in mind I cobbled something together and put it up on GitHub. If none of this really makes any sense yet, have a look at the demo app – given a list of conditions and the values they can take, it spits out a list of every possible combination of those conditions.

If you haven’t updated NDIToolbox since last time, it’s worth doing it now. Here’s where we are today.

Better support for UTWin data files, including preliminary support for compressed waveforms. That last one’s still highly experimental but let me know if it works for you; I don’t have access to a lot of sample data files for testing.

Squashed bugs, which includes better handling of memory errors running a plugin.

(Developers) A new report module which provides a quick-and-easy way of generating simple PDF reports.

Source code has already been updated, binaries will follow shortly. I’ll have more to say on the report module in a later post.

Development on TRI‘s nondestructive evaluation data analysis software NDIToolbox has slowed of late as we’ve gotten closer to our goal for functionality and as we get ready to do an honest-to-goodness field test later this year on a QA line. Nevertheless I’m still plugging away at it whenever I get the chance, and today I’ve got the latest and greatest available with two new features: support for multiple datasets in Winspect data files and a new “batch mode.”

The batch mode feature lets you run an NDIToolbox plugin on a set of input files, optionally spawning multiple processes to speed things up. If you have a ton of data files and you’re doing the same number crunching over and over, just point NDIToolbox to the files and the plugin and let it do it for you. You don’t have to convert your data files to HDF5 before using batch mode; as long as the file format(s) are supported by NDIToolbox it’ll fetch the data and run the plugin automatically. More info on batch mode available here from my mirror of the NDIToolbox docs. If you’re going to use batch mode’s multiprocessing, be sure to read up on the requirements (basically, don’t have really huge data files).

As usual, I’d recommend using the conventional Python version of NDIToolbox if you can. If you’re on Windows and don’t want to install Python (or you want to run from a thumb drive), the Downloads section of NDIToolbox’s Bitbucket page has a Windows installer and a compiled version available, no Python required.

If you’re writing a plugin there’s one additional step required to support the new batch mode. Since more than a few nondestructive testing system file formats like UTWin’s CSC or WinSpect’s SDT can have multiple datasets in a single file, batch mode will send your plugin a dict of all the datasets it finds in a given input file. So you’ll need a bit of code to see if you’ve been passed a single dataset (conventional user interface) or a container full of datasets (batch mode). There’s a few ways to do this but one of the most straightforward is to look for a “keys” attribute like so.

You could also just check to see if you were passed an actual dict, courtesy isinstance(). I’d recommend against doing that for now though – better to just assume it’s an associative container of some sort rather than hard-wiring an expectation of an actual dict.

I haven’t had the chance to do much NDIToolbox work in the past month or so while I’ve been working on another project in the lab – it did involve lasers and a chance to play with C++ after many years’ absence so I’m not complaining. I did just push out an update this week that might be of interest if you’ve been running into memory problems. Hopefully this version’s a little more thoughtful when it comes to releasing memory it no longer needs.

Also in this version, I’ve added preliminary support for ultrasonic gate functions in the MegaPlot presentation. The functionality’s always been there but I’ve had it disabled until now while I was working out how to apply gates to three-dimensional data; I’m not 100% satisfied with the implementation but thought I’d enable it and come back to it later.

Update Wed Dec 26 12:23:44 CST 2012: managed to sneak some more work in on the project before the end of the year. It’s not in the documentation yet, but I’ve added exporting slices of a data file. Handy if you’re only interested in a subset of a much larger data file.

Finally, if you’ve ever wanted to just see screenshots and read about all of this NDIToolbox stuff instead of having to download everything, I’ve put a mirror of the current documentation up on the site. Have a look at the Quick Start for a primer on what NDIToolbox does, and the Plugins page to find out about…plugins. Developers might also be interested in how to write plugins. Sample plugin code is available that demonstrates how to write a server-based plugin, and how to combine Python with Java or C++.

NDIToolbox has been able to generate B-scans from ultrasonic data for a while now, but you had to know how to take slices of data. I just added a switch in the Megaplot presentation that will now do it for you automatically. Here’s what I’m talking about:

For comparison, here’s the usual Megaplot presentation:

Just check/uncheck Plot Conventional B-scans in the Plot menu to switch back and forth.

Also available for testing is a new NDIToolbox installer for the Windows binary distribution. Tested to work under Windows 7 and 8. Available from the NDIToolbox Downloads page. As always I recommend downloading the Python version rather than the precompiled binary since it’s so much easier to keep up to date, but it might come in handy if you don’t want to install a bunch of dependencies and just want to get started right away.

Another update to NDIToolbox today, I’ve just added the ability to import data from a couple of ultrasonic NDT systems. These imports are still a little flaky because a) we haven’t finalized the HDF5 format we’ll be using in NDIToolbox and b) proprietary binary file formats being what they are, but for what it’s worth I’ve used them on the data I could get my hands on from some immersion tank scans done here @ TRI World HQ and elsewhere and they will at least let you display your data, so it’s a start. Hopefully I’ll be able to improve their functionality and add a few more importers as the project goes on.

Other recent but decidedly less interesting changes:

Support for manual garbage collection – if you’re playing with large data and you get warnings about being out of memory, you can opt to clear some out to keep working. I’ll be implementing HDF5 slicing at some point so this is a temporary work-around.

Fixed a bug in plugins-plugin support folders can now contain Python modules.

All data retrieval functions now in a separate module (models/datio.py) so your code can use them directly.

As always, the source code is up on Bitbucket and a Windows binary is available as well. These changes will also find their way into NDIToolbox Labs – we’re still plugging away on integrating the Automated Data Analysis (ADA) Toolkit into Labs but making progress.

Today we forked NDIToolbox into a new project, NDIToolbox Labs. We’ll be using Labs for experimental features before they go into the main NDIToolbox repository. Read on for more background.

One of my main responsibilities at TRI is working with Subject Matter Experts (SMEs), basically gurus in one particular field or another. A big part of my job is in helping SMEs write code from scratch or port it from MATLAB, R, C/C++, etc. to Python. The SME works out an algorithm, I code it up in Python, rinse and repeat.

I started the Labs fork because in many cases the SMEs aren’t familiar with Python or unit testing, but I didn’t want to slow their efforts down by insisting on tests for inclusion in the main NDIToolbox repository. Labs will be the unstable branch of NDIToolbox – stuff might break but it’s where all the cool new features will be. Once the SME is more or less satisfied with how their code is working in Labs, I’ll add the requisite tests and whatnot and port it to NDIToolbox stable.

One of the first new additions will be “Automated Defect Analysis,” a suite of code designed to read data and automatically locate anomalies in the sensor data. Instead of having an inspector scroll through 100 miles of pipeline inspection data for example, you’d let ADA read the data on its own and let it present you with a report of where it thinks cracks and pits were found.

Update Fri Aug 31 13:18:28 CDT 2012:Computational Tools‘ first alpha of the ADA Toolkit has been added to NDIToolbox Labs. Although it is functional and can run ADA Models (specialized NDIToolbox plugins), it’s still in the early stages of development. I’ve also put together a Windows binary if you’d like to check it out and don’t have Python installed. Be sure also to download the ADA Model ZIP from the same page to see ADA Toolkit put through its paces, or copy the URL to the clipboard and download/install from ADA Toolkit itself.

Next up will probably be some Probability Of Detection (POD) models that you’d use to simulate an inspection to find out if the inspection would actually be able to detect the anomalies you’re interested in finding. Going back to the pipeline inspection example, a POD model might tell you whether a lower resolution scan might suffice to find damage; saving you time and money in the inspection.

So to summarize: if you don’t need the latest and greatest, I’d recommend sticking with NDIToolbox stable. If you need the latest and greatest, try NDIToolbox Labs but expect bugs.

I promised to post the plugin code I wrote for NDIToolbox that demos one way to use NumPy over a network, and here you go. The scenario here is that you’ve written a plugin that uses scikits-image; rather than make your users install scikits-image you’ve decided to host the number-crunching code yourself and have plugins that “phone home” for the results. After looking at your options you’ve decided XML-RPC and SimpleXMLRPCServer look promising.

As before, the server isn’t all that different from the sample code illustrating SimpleXMLRPCServer. You’ve decided to write a single “edge detection” server app that provides a few different edge detection algorithms through XML-RPC; your NDIToolbox plugins will send the server their NumPy data and the server sends back the detected edges for replot in NDIToolbox.

The first plugin you’ll write is a Sobel edge detection plugin – it’s a nice place to start with scikits-image since it doesn’t take any arguments.

You do list the URL to your edge detection server as a config option so that the user has the chance to update it if necessary.

Quick aside – in an NDIToolbox plugin, any field you add to the plugin’s config dict is presented to the user as an editable field. When the user runs the plugin, NDIToolbox brings up an additional config window that allows them to edit the configuration as required.

Here’s the before and after of the Sobel plugin on Windows:

The next edge detection algorithm your server provides is the Canny algorithm – unlike Sobel, there are some parameters that the user can configure in this method, so we’ll add them to the plugin’s config dict to make them user-editable.

And here’s the before and after on the same ultrasonic C-scan data as shown above for the Sobel edge detection.

(The Canny results aren’t as nice as the Sobel, but we can always tweak the input parameters until we get better results.)

To distribute your new NDIToolbox plugins, you could just provide your users with a copy of both plugin Python files and ask them to copy to their NDIToolbox plugins folder, but NDIToolbox will do that for them automatically if you package them properly as ZIP files. To do that for the Sobel plugin for example, create a new ZIP sobel_edge_detection_plugin.zip and add sobel_edge_detection_plugin.py and a README to the archive such as the following.

Once we’ve created the sobel_edge_detection_plugin.zip file, we can directly provide the archive to the user and have them perform a local installation. The other option is to host the archive on a server and allow the user to perform a remote installation such as the Linux user shown below.