Friday, December 21, 2012

Big Data projects often involve a lot of stovepipe processing, and visualizing data flows is a powerful way to convey data provenance to the end user, and even allow control of the involved processes.

There are a number of tools available to visualize data flows, but most suffer from some limitation. Some allow labeling of only nodes, but not edges. Some do not have a provision for the node label to be inside of the node. Most use general-purpose graph layout algorithms such as "gravity", whereas dataflow diagrams have distinct start nodes and end nodes and are better represented mostly orthoganlly either top-down or left-right. (The well-known dataflow language G from LabVIEW is left-right, but top-down is better suited for web browsers.) Graphviz can generate a nice top-down layout, but for the web can only at this time produce static images (albeit dynamically), with no mouseovers to facilitate drill-downs into processes or data stores.

Dagre, a Javascript library built on top of the D3.js visualization toolkit, is very well-suited to visualizing Big Data data flows. Below is an example (contrived, but illustrates the idea):

Saturday, December 1, 2012

Apache Thrift, originally developed by Facebook, is an immensely useful general-purpose inter-process communication (IPC) code generation tool and library. Although it supports a variety of IPC mechanisms, sockets are its primary conduit, and as such, is naturally language agnostic and actually its tool can generate code for a dozen different languages.

I use it to provide PHP web interfaces to monitor and control C++ scientific/industrial semi-embedded systems (desktop PCs loaded with data acquisition and control hardware).

Sometimes, those PCs are running Windows. With the recent 0.90 release, Apache Thrift support for Windows is leaps and bounds beyond what it used to be, but it's still "only" 98%. Here are the missing steps:

First of all the good news: Use on Windows no longer requires Cygwin or MinGW, despite what the outdated documentation states.

You will, however, still need to compile the Thrift libraries yourself, if you plan to use Thrift with a compiled language such as C++. Thankfully, the Thrift distribution comes with a Microsoft Visual C++ .sln solution file. The thing to know, however, is that it is a Visual C++ 2010 .sln file, and will not work work with Visual C++ 2008. You can use Visual Studio 2012, but recall Visual Studio 2012 does not work with XP, which I still use for development because of both data acquisition hardware drivers and some legacy software development tools (for some legacy codebases). Thankfully, you can use the freely available Visual C++ 2010 Express, which is still available for download even though Visual Studio 2012 has been released. To download an ISO (to preserve your ability to reinstall in the future) instead of a stub/Internet download, select the option for the "All-in-One ISO".

The \thrift-0.9.0\lib\cpp\thrift.sln contains two projects: libthrift and libthriftnb. The libthriftnb is for the non-blocking server, and if you want to use it from a server, you must link in both libthrift and libthriftnb, as well as utilize TNonblockingServer instead of TSimpleServer. Note that "non-blocking" means non-blocking from the client perspective. On the server side, the call server->serve() actually blocks. To make either TNonblockingServer or TSimpleServer non-blocking from the server code perspective, just wrap it inside a new boost::thread().

Compiling libthriftnb is trickier. First it requires libevent. To compile libevent, Start->All Programs->Microsoft Visual Studio 2010 Express->Visual Studio Command Prompt (2010), navigate to the libevent directory, and nmake -f Makefile.nmake. Second, libthriftnb pulls in Thrift library code that does #include <tr1/functional>, but since Visual C++ 2010 doesn't support TR1, you can just replace it with <boost/functional.hpp>.

Tuesday, October 30, 2012

If you are using a 32-bit Windows installer, it is not straightforward to have it install a 64-bit driver. There are at least two reasons why you might be in this situation:

Your installer software is not the latest (or maybe doesn't even have a 64-bit version yet) ... or ...

Most of your components are 32-bit with just one or two that you want to differentiate 32 vs 64 bit.

The problem arises because shelling out to msiexec.exe from a 32-bit installer (and in the case of InstallShield, whether that be from InstallScript or as a Custom Action), the 32-bit C:\Windows\SysWOW64\msiexec.exe gets executed instead of the 64-bit C:\Windows\System32\msiexec.exe.

The basic answer comes from technet and the VB.Net code below is adapted from that with a slight improvement. By compiling the VB.Net code into an executable and shelling to that as an intermediary, the 32-bit world can be escaped from. The slight improvement to the code below is it preserves quotes around quoted arguments such as pathnames with spaces.

Then the InstallScript to invoke it is below. It detects whether the OS is 64-bit, and if so installs the 64-bit drivers via the VB.Net code above (which is compiled to an executable cmd64.exe); otherwise, it installs the 32-bit drivers.

Below is the bit of magic to draw up the XML data into Javascript memory space. Assuming there is a Javascript constructor called ChartSeries that takes four parameters )name, array of x values, array of y values, color), the code below uses XSL to shove the x values inline in a comma-separated manner into the Javascript.

Friday, October 19, 2012

Using a Radeon 7970 as a GPGPU, I was running into some seeming limitations on how quickly I could download data off of the board into main CPU RAM. There seemed to be about a 20MB/sec limitation for the board, which is of course nowhere near the 16 GB/sec limit of PCI 3.0 x16. It turns out the limitation is for a single work unit (out of the 2048 work units/processors on the board). It also turns out that because writes to global memory (i.e. memory sharable with the CPU host) are so expensive, it can often become more important to parallelize the memory writes than to parallelize the computations! To me, this was counterintuitive because I envisioned writes to shared memory as being serial and fast, but they instead seem to be on some kind of time multiplex for the multiple work units.

Consider the following code that computes the first 1024 Fibonacci numbers, and does so 1024 times over:

Clearly the memory transfer is taking the bulk of the time, and the computation of Fibonacci numbers hardly any time at all. The way to speed it up is to speed up the memory write, but what could possibly be faster than async_work_group_copy()? It turns out there is a bit of intelligent cache maintenance going on behind the scenes. If we can write to buff[] from multiple work units, then async_work_group_copy() can pull the data from the memory associated with multiple work units, and it goes much faster.

This runs at8 work units: 122 MB/sec
That's a 6x speedup for increasing the number of work units by 8x! We could no doubt speed it up even more by increasing the look-ahead to increase the number of work units.

Recall that when we commented out the computation completely it was only 38 MB/sec, so the speedup is from parallelizing the memory writes, not from parallelizing the computation.

Tuesday, October 2, 2012

Desktop supercomputing is now cheap, mainstream, and mature. Using GPGPU (General Purpose computing on a Graphics Processing Unit), you can write C programs that execute 25x as fast as a high-end desktop computer alone for just $500 more.

The OpenCL standard, started in 2008, is now mature. It provides a way for C/C++ programs on Windows and Linux to compile and load special OpenCL C programs onto GPGPUs, which are just off-the-shelf high-end graphics cards that videogame enthusiasts usually buy. When you buy one of these cards for your supercomputing project, expect lots of snickers from your purchasing or shipping/receiving department when it arrives with computer videogame monsters on the box.

As an example, the approx. $500 Radeon 7970 has 2048 processing cores on it, each capable of double-precision floating point running at about 1 GHz executing on average one double-precision floating point operation per clock cycle. The double-precision is actually new to this generation of Radeon and the OpenCL PDF document standard hasn't even been updated yet to include the data type, even though the API SDK header files have been.

Using the freeware GPU Caps software, the Radeon 7970 by itself (without assistance from my desktop computer's 3.3 Ghz Intel i5 2500) clocks in at 25x the computational power of the four-core (single processor) Intel i5 by itself.

To get a dual-processor Intel motherboard and second Intel processor is a $1000 increment, and that's only a 2x speedup, so a 25x speedup for a $500 increment isn't just a better deal, it's a new paradigm. As Douglas Englebart said, a large enough quantitative change produces a qualitative change.

Up to four such cards can be ganged together in a single computer for a total 100x speedup. But since each card is physically three cards wide (to accommodate the built-in liquid cooling and fans) even though it has just one PCIe connector, you will need a special rack-mount motherboard to go to that extreme (note I have not tried this!).

By comparison, to go 100x in the other direction, to get a computer with 1% of the computation power of my desktop i5, it would require going back 15 years to a Pentium II. So a four-Radeon system represents a sudden 15-year leap into the future.

Sunday, September 9, 2012

libav, which is intended for codecs, also serves as a nice signal processing library for scientific applications as it has assembly FFT routines optimized for various processors including x86 SSE and ARM NEON. Written in C for C, it can be a little tricky to use in C++.

The second trick is that to use the nice C++ array types std::vector or boost::multi_array, it is necessary to use a custom allocator class that calls the libav av_malloc() and av_free(). The reason is that these ensure the 32-byte alignment that libav assumes (without checking every time) is present when it utilizes SIMD instructions. Without a custom allocator, std::vector and boost::multi_array just use new[], which does not allocate aligned, and libav in the process of ANDing addresses ends up running past the end of the buffer and generating a segmentation fault.

The code below uses a custom allocator adapted from The C++ Standard Library -- a Tutorial and Reference. The advantage of using std::vector or boost::multi_array, of course, is automated memory management using the Resource Acquisition Is Initialization pattern/idiom, similar to C++ auto_ptr (which can't be used for arrays because it is hard-coded to delete instead of delete[]). Although std::vector is used below, the same allocator works equally well with boost::multi_array.

Tuesday, August 28, 2012

The Samsung 11.8, information about which came out of the Apple lawsuit, is a good match for in-the-field scientific/data acquisition/NDT applications. Why? Because it will be the first tablet with an ARM Cortex-A15 processor, specifically the Samsung Exynos 5.

It's just an ARM processor, right? ARM is extremely low power consumption, but mediocre computational performance, right? Wrong -- ARM has grown up. ARM has maintained its extreme low power consumption while catching up to Intel performance. And the Cortex-A15 is a huge step forward in that regard. It features NEON, which is the ARM equivalent to Intel MMX/SSE/AVX vector processing, which speeds up by multiples the signal and image processing used in scientific computing. And the NEON in Cortex-A15 can do double-precision. Oddly, the current generation Samsung Galaxy tablet, 10.1, dropped NEON even though its even older predecessor had it. But the Samsung 11.8 Cortex-A15 NEON has 128-bit ALUs instead of the 64-bit ALUs that the earlier ARM NEON processors had.

The Samsung 10.1 has thus far escaped injunction from Apple, so there is hope the 11.8 will as well. The iPad won't get Cortex-A15 until the iPad 5. There is hope that the Samsung 11.8 will be unveiled tomorrow at the Unpacked event in Berlin.

Then, thanks to stackoverflow.com I learned that smatch retains just pointers to the source string "helloworld.jpg" and not actual substrings. So when the temporary object string("helloworld.jpg") gets released, the smatch pointers are left pointing to released (and thus undefined) memory.

The correct code is:

// Good codesmatch sm1;string s1("helloworld.jpg")regex_search(s1, sm1, regex("(.*)jpg")); cout << sm1[1] << endl;
Now, in real code you wouldn't pass a hard-coded temporary object, but during debug/development you might to test out how regex handles different scenarios, which is what I was trying to do when I ran into this.

Thursday, June 21, 2012

With HTML5 canvas and its built-in ability to rotate text, I came across (i.e. learned the hard way) about the limitations of bitmap fonts, such as MS Sans Serif, in XP. XP has trouble rotating bitmap fonts at small point sizes, while Windows 7 does not. So the following HTML5 canvas code works in Windows 7 & Firefox but fails under XP & Firefox:

Sunday, June 17, 2012

I was initially excited about the UI design philosophy of Win8 Metro.
But then I realized that HTML5 can do 95% of what Metro can, and also be
truly cross-platform.

I see HTML5 as the cross-platform holy grail that developers have been
seeking since the WORA days of Java 15 years ago. First it was supposed
to be Java, then Microsoft embraced and extinguished it, and besides it
had too big of a footprint download (and a clumsy download process to
boot). Then Flash was supposed to be the universal small-footprint. It
was just about to take off, then Apple extinguished it by not supporting
it at all (completely skipping the "embracing" step). Then Microsoft
finally decided to stop holding back .NET from web development -- the
purpose for which it seemingly was originally designed but never
delivered upon until Silverlight. But by then Windows market share was
too small for Microsoft to force a Windows-only solution on the web
world.

Even when it comes to CPU-intensive signal & image processing, Javascript seems to be "fast enough". E.g. these guys show real-time 2-D FFT. Admittedly, the combination of SSE/AVX and multi-threading/multi-core would have provided a 30x speedup, but I've been playing around with real-time 2D graphics in JavaScript and have been amazed at its performance. I was even guilty of premature optimization -- I started out coding for double-buffering the graphics with two Canvases and ended up throwing out the double-buffering because with just one Canvas there was no flicker.

On today's processors, Javascript will be "fast enough" for many applications. E.g., I do scientific software for a living, and I'm partitioning the work into what has to be done natively -- mostly the acquisition and crunching of tens of gigabytes of data at a time -- vs. what can be done cross-platform -- the final post-processing of tens of megabytes of pre-processed data.

Wednesday, February 22, 2012

It's not enough to move data stores from Program Files to ProgramData when upgrading applications from XP to Vista/W7. It's also necessary to, in the application installation, set the permissions on those files to read/write for the group "Users". Otherwise, Vista/W7 will silently virtualize those files into shadow copies to the Users folder on a per-user basis.