Intro

The core competence in doing science is being reproducible. Everyone who has
achieved cold fusion in his kitchen knows that; so why would Software Defined
Radio developers go for something less than a fully integratable, graphical
workflow (that, of course, can easily be scripted)?

Currently available (and in a really badly documented state, to make the SDR
guys feel as much at home as the cold fusion experts) is the code to all the
tools that you'll need to define bigger tasks and run them locally.

The Workflow

Now, without delving too deep into the software architecture below, the graphical workflow for running flowgraphs should look something like this:

Now, how does that work in detail?

Let's move along a very simplistic example, so we don't get confused in the process.

Developing Applications in GRC

Now, a flowgraph that we want to test with a lot of different parametrizations should run with at least one of them; so let's create such a flow graph:

As you can see, absolutely no magic happens here: We have a constant source for which we set the value based on our value variable:

The same happens for the number of samples (as limited by the head block) and
the length variable.

Then there's the sink. It's a plain, old, boring, vector_sink, available
everywhere. I decided to name it value_sink.

Phew. That was actually a bit boring. So let's now have a look at what my GNU Radio branch actually introduces into the GRC:

Defining Tasks using task_frontend

Clicking that button saves the Flow Graph to a file, and runs task_frontend
with that file:

What we see here is the configuration tab of the task definition user
interface. As instruction "RUN_GRC" is selected, an option that will embed the
source code of our GRC file into the task, and will only generate a python file
when running.

Now, since we know (ok, we just assume this for the sake of this example) that
we have used the companion to create a flow graph whose python implementation
we install as module to our out-of-tree module, we want to change that to
"RUN_FG":

I've already filled in my own modules' name, mtb, and the name of the python
module, extraction_test_topblock.

Now that we have the task defined as is, we should have a look at the available
sinks. These are the points where we actually extract data out of the flow
graph after running it. Analyzing the flow graph, the task frontend has already filled in every sink it could find.

Now, having defined only one sink in our flow graph, this is adequate. However, you might have multiple vector sinks, and if you're not planning on analizing data from every one of them, you might just want to remove some sinks.

Ok. Now the interesting part: The parametrization tab. As you can see, it's a table of available parameters. Now, by default these all are set to "STATIC", meaning that at benchmarking time, that specific value will be set and kept constant for all runs.

However, as you can see, I've set the type to LIN_RANGE and LIST respectively. They work like follows:

When using LIN_RANGE, one specifies a triplet of numbers: (start, stop, number_of_steps), and at benchmarking time number_of_steps values equidistantly spread over the interval [start, stop] will be set

When using LIST, the supplied text will be executed and converted to a list, and will be stored in the task file. Valid entries are 1,10,100, things like range(100), or any other python expression that can be evaluated and converted to a list.

Now, let's run! the task from the File menu.

Visualizing Results

After a while, after having run all 150 measurement points (3 values for length, 50 for value), the visualization user interface will appear:

Now, if we don't select a variable from the top list, we can look at the different values our parameters have. I chose the length parameter here, which, to little surprise, assumes the values of 5, 20 and 30. You can also select multiple lines, but since our value parameter is only between 0 and 1, we won't see much here.

Now, let's find out if our experiment is sane:
We select value_sink, of which we know that the sink value (reduced to a single number by applying a mean) should always assume the numper that is set for the flow graph parameter called value; therefore, we select value in the parameters list:

Luckily, GNU Radio still works, and the number we put in is the number we get.

Architecture, Networking and Scripting

As you can see, the task_frontend tool has functionality to load and store tasks as JSON; the simple reason is that you might not always want to run tasks locally, for example if you little time but many PCs at hand, or if you want to test the same flow graph on a lot of different machines. Now, to understand how network based execution can take place, I'll explain in tomorrow's blog post how the whole system is set up, how to use it to generate versatile flow graphsand how to use my python classes in your own applications.

As the midterm evaluations are upon us, I should take the time to share my progress (or lack thereof) with you.
Wishing that more of my original plans would have worked out, this feels quite unsatisfactory for me; so here's my state and what I'm planning to do about it:

Benchmarking

Whole-application benchmarks

Like gr-benchmark, the key to these is the clever usage of what is available as Performance Counters from within applications. This means that we should be able not only to read how much time each block has consumed comparison to the total time, but how much of that was spent doing actual computation, searching for tags, etc. As described, this calls for a general approach to make it easier for block developers to measure and publish these values.
Sadly, this hasn't gotten very far.

Data Extraction Blocks

The remote_agent (see below) automatically extracts vector_sink_*s from the flowgraphs employed and collects their data; this seemed much more intuitive and non-repetitive that I, for now, abandoned the idea of writing new extraction blocks. However, with the current agent structure, any property of a flow graph can be measured.

Infrastructure

This was where most of my time went so far -- I've made a few bad design choices, so I'm right now in the process of rewriting most of my dispatcher infrastructure to be zeroMQ-based instead of relying on execnet.

RPC Framework

RPC with python is rather easy. Execnet has a function remote_exec(obj), which takes a python string, a methodname or a modulename, and executes that on a remote target, after bootstrapping there autonomously (i.e. you only need SSH access, execnet will transfer itself and the module source code over there without your help). Sadly, I relied to much on things working remotely like they do when I use a local gateway. There is one caveat, though: remote_exec does not guarantee that state is kept between calls. That basically means that everything you want to do would have to be done in one standalone module file; also, the execnet communication method relies on channels -- which are fine, but things like "this gateway dropped out, you can't use the channel anymore to communicate" mean a lot of exception handling everywhere, and execnet's channel model does not seem to have been written with implementation of continously running servers, especially multithreaded ones, in mind.

Since that was clearly not the direction I wanted to take, I have reduced my execnet usage to two things:

bootstrap_agent

Which will test if you can import gnuradio (or anything you ask for) on the remote machine, can download and transfer the pybombs package to the remote, extract such packages to a common path, and call pybombs to install GNU Radio into a definable prefix.
This prefix can later be used when setting python library search path and binary and shared object paths.

remote_agent

The actual zeroMQ server. It has multiple ZMQ sockets, so I might have to explain a few ZMQ concepts first:

ZMQ sockets behave somewhat like sockets, done right. You can have different types of sockets for different types of information flow, and ZMQ will make sure that data gets where you want it

ZMQ Request/Reply Sockets: Like their unixoid brothers, these are point-to-point connections. Functionally, these behave like you would expect from a client (Request) or server (Reply) socket

ZMQ Push/Pull Sockets: These behave more like a global round robin queue: A number of Pull Sockets may each consume messages from the Push socket, which then pop from the Queue, or multiple Push sockets might feed a single Pull one

ZMQ Publish/Subscribe Sockets: Think GNU Radio message passing. An arbitrary number of subscribers might subscribe to a publisher, and all published messages reach all subscribers (they have individual queues)

So there's three interfaces (four if you count the execnet bootstrap) in my remote_agent:

Using execnet, the remote_agent gets bootstrapped and an instance of the class of the same name gets started

Upon instantiation, a reply socket binds to the address received via the original execnet channel

The remote agent receives commands on that reply channel; typically, these cover things like setting a unique name, setting the python paths, but especially setting up addresses for the dispatcher's push socket

The Pull socket (given the appropriate control socket command) connects to the dispatcher's Push socket; this way, multiple remote_agents can easily form a pool for tasks that need to be computed by only one agent (e.g. speeding up a laaarge BER curve by letting the computers in your lab run the simulation for parts of the noise power range each)

The Subscribe socket attaches (on command) to the dispatchers Publish socket. This is convenient for all-have-to-do-this tasks, like running your new trellis module on the six different computing platforms and 2 VMs for testing and benchmarking

Task Storage

Everything is done in JSON (ZMQ even has nice wrapper functions that can receive and send JSON and give you python objects); so storing tasks is directly done in that format and loadable from disk.

Result Gathering

Results from the remote agent contain the ID of that agent; however, consistent ways to concentrate and tabularize these JSON strings/python dictionaries have yet to be found.

Integration

The integration objective to extend the GNU Radio Companion in such a way that defining benchmarking parameter sweeps is possible from within GRC, which means that I want to be able to choose "distributed benchmark" as generate option and get an interface to define the overall parameters to test and a way of defining which machines are subject to these benchmarks.

For now, I have settled with an approach that will work with the default "no GUI" generate option: GRC generates a top_block subclass. Using a walk through the members of an instance of that, we can find the vector_sinks and extract data from them. For parametrization, users can use the normal variable blocks, and the remote_agent translates parametrization tasks to calling the setters of these variables prior to running the block.
Note that this is not the level of integration, since it does neither allow the user to define which variables should take a range of values nor assists him by offering a list of variables existing.

The most elegant way to do that is from my current point of view extending the Cheetah template used to generate the python files, adding @property decorators along with custom decorators for range setting; this is in the "where do I integrate that into GRC without messing up the concept too badly" phase.

Feedback on my Problem Synopsis

Since he brought up interesting aspects, I asked him if he'd be okay if I shared his mail and my reply on this blog; he said yes, so here we go:

Hi Marcus,

I like the approach you take by looking at what real life users will want from gnuradio when transitioning from an academic perspective to realtime system. Measurements are always not enough to understand the specifics of your system so I'm looking forward to see how your project provides measurements to the gnuradio user.

Building on block computing performance measurements there is one thing I would like to see in gnuradio and that is a flowgraph optimizer.

To be more specific, a flowgraph optimizer would try to adapt the parameters of the blocks (e.g. the data chunk passed to each block) in order to optimize one/more parameter(s) of a flowgraph (e.g. overall processing time). In a normal way this optimizer should be run just once to determine the optimum parameters that will be used subsequently. If we see the problem to solve from a general perspective the optimizer would fall in the category of multi-objective optimization which has a numerous solutions and has been thoroughly discussed in the academia and industry (gaming is usualy doing multi-bjective optimization through AI). Another real-life example would be the optimizer in the 4Nec2 antenna simulation program that uses AI to optimize the antenna when a set of objectives (variables) is set by the user, e.g. minimum SWR, Z close to a value, etc.

In my opinion gnuradio will really benefit from such an optimizer as the values of block parameters can provide quite different end results.

Not sure if this can be part of your GSOC project but I thought it worth mentioning to you and gnuradio users on this list. Maybe can be part of the next GSOC.

Thanks,

Bogdan

A few Thoughts on How GNU Radio inherently optimizes

This brought up a lot of interesting points, so here's my reply:

Hi Bogdan,

thanks for your comment :)

Such an optimizer would be really, really fancy.

In a way, though, GNU Radio already does this when running a flow
graph: It just asks blocks to compute a reasonable amount of items to
fill up the downstream buffer. This actually (conceptually simple)
approach is one great strength, because it just keeps the computer
"busy" as much as possible.

There might be space for optimization, though, I agree: Maybe it would
be better for some blocks just to wait longer (and thus, not utilize
the CPU) if it was computationally beneficial to work on larger
chunks, as long as there are enough other blocks competing for CPU power.

However, this leads to the problem of balancing average throughput
with latency.

What the GNU Radio infrastructure does to approach this is actually
quite simple:
1. Although it might be "fastest" to process all data at once, buffer
lengths set a natural limit to the chunk size, and thus latency. So we
have an upper boundary.
2. It is best to be closest possible to that upper boundary. To
achieve that, block developers are always encouraged to consume as
many input items and produce as much output as possible, even if the
overhead of having multiple (general_)work calls is minute. This
ensures that adjacent blocks don't get asked to produce / consume
small item chunks (which will happen if they were in a waiting state
and a small number of items was produced or a small amount of output
buffer was marked as read).

Optimizing this will be hard. Maybe one could profile the same
flowgraph with a lot of different settings of per-block maximum output
chunk sizes, but I do believe this will only give as little more
information than what the block developer already knew when he
optimized the block in the first place. If he didn't optimize, his
main interest will be if his block poses a problem at at all; for
that, standard settings should be employed.

To give developers an API to inform the runtime of item amount
preference, different methods exist, however. I'll give a short
rundown of them.

1. Most notable are the fixed_rate properties of gr::block, as
implemented in sync_block, and the decimator and
interpolator block types
2. If your block will only produce a multiples of a certain number of
items, the set_output_multiple is a method that will potentially
decrease overhead introduced by pointless calls to forecast and/or
work.
3. In hardware optimization, alignment is often the performance
critical factor. To account for that, set_alignment was introduced.
It's working very similar to set_output_multiple, but does not
enforce the multiples, but sets an unaligned flag if non-multiple
consumption occurred. The runtime will always try to achieve that the
start of your current item chunk is memory-aligned to a certain item
multiple. If however less was produced, your block might still be
called, to keep the data flowing.

To properly apply these flags, you'll basically need a human
understanding of what the block does. It may, nevertheless, be very
helpful to understand how well your block performs with different item
chunk sizes. To realize that, some mechanism to change scheduling
behavior

I will look into that; I think it should be possible to manipulate the
block_executors to manipulate them into changing their
forecasting/work calling behavior at runtime, but I'm quite sure that
this will bring new code into the main tree [1].

All in all, right now I'm really stuck with what I actually want to
improve with the performance analysis of GNU Radio flowgraphs offered
by performance counters/gr-perf-monitorx, because they address many of
these issues already. Your execution-per-item over chunk size idea is
excellent!

I'll really have to take a deeeep look at block_executor and the tpb scheduler to tell; if I decide to add functionality that introduces significant runtime overhead or changes too much of internal behaviour, noone will be pleased, so I might take this slow and will have to discuss it with experienced core developers. I'm not very hesitant when it comes to fiddling with in-tree source code, but my workings almost never make it to the public, because I always figure they don't address a problem properly or break too much in comparison to what they can possibly improve.

Conclusions

To my pleasant surprise, I've found a real extension of the measurement scope of the currently existing performance counters. This will actually help me in getting to find a first point of attack for gr-benchmark, which is proving to be more versatile than I thought when writing my proposal.

While writing my application for Google Summer of Code, I was actually considering a lot of thing that I wanted to tackle in GNU Radio for quite a while. Amongst optimization, an improved documentation, distributable flow graphs and a lot of other things, one issue stood out:

GNU Radio is actually easy to use, but it's hard to prove that you're using it properly. To explain this, we'd have to take a look at the purposes GNU Radio serves by today:

1. GNU Radio is used for academic purposes. That is, almost always for starters, for creating simulations of some communication system, but very often later for easy conversion of a simulated system to a real-world software defined radio.
Here, proving that you're getting the right results is absolutely crucial. After all, this is science, and you can hardly call it progress if you can't document, reproduce and communicate your findings in a way that makes your audience confident in your credibility.

2. GNU Radio has a ever-growing audience of communication applications. This means, that people buy hardware and invest a lot of engineering time into developing real-time operational systems.
This leads to a high demand of optimization; GNU Radio already tackles that with approaches like the Vector optimized Library of Kernels, which highly increases the performance of specific numeric operations on large sets of data.

The following will try to put my understanding of the state of the art into words, explaining as I move along these topics what I plan for the Measurement Toolbox.

To illustrate, I'll introduce the idea of purely simulated transceiver with a channel model that defines noise as additive.

To create the BER-over-SNR graph , the user can do a lot of thing in GNU Radio:

Build the simulation flow graph, and add instrumentation sinks for BER, as well as GUI input widgets (e.g. sliders) to vary the noise power. Then, change the noise power manually, note down the BER value that is shown in the graphical sink, rinse, repeat. In the end, use the gathered data to generate the graph. Obviously, this is tedious, and a little error prone.

Build the simulation flow graph, equip it with a head block to limit execution duration and use a file sink to write away the BER, and running that flowgraph repeatedly with different noise powers. This can, for example, be done by constructing the flow graph in GRC, and executing the result python program with varying command line options. This sounds a lot nicer, because we can script that to generate the table of data necessary for graph generation.

Use the same flow graph, but instead of running the program repeatedly from command line, modify the generated python file to set the noise power. This sounds much more elegant and will increase speed, but we have to realize that now the researcher has not only to define the flow graph, but to find suitable conditions or hooks to make sure when to extract the current BER and set a new noise power. That might not be that hard, but it will lead to flow graphs that no longer match with the GRC file, and you will always have to write (or at least copy&paste) update-and-extract code for each simulation you might ever run.

In the end, as a user, you don't really want to do this. What you'd want is more along the lines:

Define some characteristics that should be changed. This could be the aforementioned noise power, but it could also be something less numeric - maybe different impairment models, etc.

Define ranges for these characteristics, define conditions on which measurement will end, e.g. number of processed samples, running time, total errors occure, etc. Make that definition flexible yet clear.

Define some properties to observe. Again, this is not only limited to BER, and it might be an item stream, or a message port, it could also be a function to probe or some other metric.

You want your system to collect and keep the data itself. No writing down. No figuring out later that having the average is nice, but you should have also calculated the standard deviation or any of the like. Storage is cheap. You wanted these characteristics, and we can assume there's not going to be a flood of values, so why not store them all?

Ignore where you're measurements will run. A test case with some hundred different parametrizations (e.g. you vary noise power and frequency behaviour of the channel and try to find out how the combination of these effects affect your system) might take some time to run - why not use a room full of computers to run portions of the overall workload in parallel?

Select data to visualize and get graphs. Export these graphs to your favourite graphics format, or paste the data into a file that your colleague might use for his own purposes.

Enter the Measurement Toolbox: Defining these test cases, getting the data out of the running flow graph into a database and extracting it from there later is a core milestone. Integration into the GNU Radio companion and common export formats make the understanding, sharing and publishing easier; as you will see in the next section, the ability to run on remote computers will come for free.

A lot of work is currently going into efforts at speeding up GNU Radio infrastructure
(VOLK, the gr-trellis GSoC project, ongoing work on working on the buffer architecture of GNU Radio to allow smarter and directly device-accessible buffers and a lot more).
This is becoming a more vital part of the GNU Radio project as the limits of realizable systems approach real-world medium access controller, GNU Radio is seeing more usage on embedded devices, and GNU Radio's capabilities have reached a point where using GNU Radio-based SDRs is competitive to using hardware transceivers.
The objectives of many GNU Radio users have shifted from building proof-of-concept systems and simulation software to high-throughput systems.

However, to optimize the computational performance (or just performance from hereon), one has to know where time is spent. This should happen on different levels:

A component level. In GNU Radio, this means looking at which blocks use how much of the total CPU time.

A algorithmic level. This means that of certain implementations should be compared. That implies looking at things like VOLK kernels in isolation, and finding out how well they work

A systemic level. A computer running a set of GNU Radio blocks is not really executing things fully parallel without any performance effects: How far does increasing buffer size, and thus, latency, increase performance due to en-block processing? What are the performance hits happening when changing between block threads, each using a part of the CPU caches?

I will lose a few words on each these problems; because things get a little technical here, I assume you have worked with GNU Radio before and know that a block is an instance of a subclass of gr::block, and its general_work method gets called in order to produce items.

The most interesting question here is: Of the total CPU time used by GNU Radio, how much time does each block consume?

To understand a bit of the problems involved, we must go through a little background:

If you look at GNU Radio, the standard (and only fully featured) scheduler (as
of today!) is the thread-per-block scheduler. What happens here is that the GNU
Radio runtime spawns a thread for each block instance your flowgraph uses.

Initially, these threads are in a waiting state, because none of them have
input, which means that aside from the sources, none can do anything. The
source blocks' threads are being notified, each block_executor calls his
blocks' general_work, supplying the amount of available space in the output
buffers as noutput_items (or at least a reasonable portion of it).

Finishing its work, the block then notifies downstream blocks' threads that
things have changed; and since there is the whole output buffer free, the
downstream blocks can start working away. The moment they finish, GNU Radio
knows how much items they consumed, freeing space for items produced by the
sources.

Thus, as soon as the first items have propagated from source to sink, the whole
flowgraph is running at maximum throughput: While a downstream block is still
processing data, an upstream block might already be producing more items.
The more CPU cores you have, the more blocks can execute in parallel -- as long as
there are no bottlenecks.

As soon as there is a single point of congestion, as soon as buffers upstream of that
block are full, and as soon as items from the downstream buffers got completely emptied,
our application becomes single-threaded.

So obviously, there was big interest in finding out how blocks interact and how much time they
spend computing. Sadly, performance measurements in multi-threaded applications are quite non-trivial
if not done from within.

Thus, ControlPort was introduced. It offers the ability to get performance counters (along with other
control information) out of the scheduler framework using RPC calls.
Aside from some installation hassle, it works quite nicely - and that's what makes it so powerful.

Together with the UIs for ControlPort, one is able to understand where CPU cycles go when executing a whole flowgraph.
This alone is vastly useful, but bundled together with the ability to evaluate algorithmic performance, an automatable framework was written: gr-benchmark .

gr-benchmark can execute a flowgraph and collect performance data, it can condense this into JSON files and even has a uploading automatism to submit things to http://stats.gnuradio.org for analysis.
What it lacks is the ability to run GNU Radio applications under an easy-to-define set of conditions, integrate well with other tools, and automatically distribute workload: Issues I plan to tackle.

As soon as a bottleneck has been identified, developers may strive to optimize the computations involved.
Digital signal processing in general and GNU Radio in special leave great room for optimization: Often, vast amount of high-bandwidth data have to be processed under near real-time conditions,
and often, mathematically basic operations (such as convolution, dot products) leave much space for optimization.

Luckily, modern hardware offers means to improve performance: From cached RAM over preemptive computation up to rather massive SIMD instructions. Compilers also try to optimize as much as possible, and assembler code often looks massively different from what the source code would dictate if compiled verbatim.

GNU Radio tries to make use of these features by employing kernels. These are implementations of an algorithm, being either the plain-C++ implementation or employing hardware optimization by using certain compiler intrinsics. This means that for each VOLK protokernel, there are multiple realizations of the same functionality, one that should run on every platform, and for each supported hardware type a specialized one.

GNU Radio comes with a tool, volk_profile, that executes every kernel implementation (ie. C++ generic and all machine-specific implementations that could be compiled for the current machine) and measures the time they consumed, to find optimum kernels for each functionality.
This does lead to interesting results: Sometimes, the generic implementation outperforms all optimized versions on a SSE4 machine; on other SSE4 enabled CPU models, things are different.

This shows that testing optimized code on various machines is a necessity. gr-benchmark is capable of uploading VOLK performance data; however, this is less helpful when developing kernels but will only point at need for improvement when kernels have been widely deployed.

As a result, I specified that the Measurement Toolbox will be able to distribute the same benchmark to a list of hosts, each sporting a different machine type. Results must be gathered centrally.

Also, performance of vector kernels might be quite sensitive of things like cache performance and usage of the cache (possibly leading to cache misses and the like). Getting a little more statistical background (primarily standard deviation) on performance measure might prove handy when optimizing systems.