Reblogged from the Lanedo GmbH blog: Would you want to invest hours or days into automake logic without a use case? For two of the last software releases I did, I was facing this question. Let me give a bit of background. Recently the documentation generation for Beast and Rapicorn fully switched over to Doxygen. [...]

During a conference some while ago, Jacob Appelbaum gave a talk on the usefulness of the Tor project, allowing you to browse anonymously, liberating speech online, enabling web access in censored countries, etc. Jacob described how the anonymizing Tor network consists of many machines world wide that use encryption and run the Tor software, which are [...]

In the last few days I finished reading the “Black Swan” by Nassim Nicholas Taleb. Around last January I saw Günther Palfinger mentioning it in my G+ stream looked it up and bought it.

At first, the book seemed to present some interesting ideas on error statistics and the first 20 or 30 pages are giving good examples for conscious knowledge we posses but don’t apply in every day actions. Not having a trading history like the author, I found reading further until around page 100 to be a bit of a drag. Luckily I kept on, because after that Taleb started to finally get interesting for me.

Once upon a time…
One of the lectures I attended at university touched on black box analysis (in the context of modelling and implementation for computer programs). At first of course the usual and expected or known input/output behavior is noted, e.g. calculus it may perform or pattern recognition or any other domain specific function. But in order to find out hints about how it’s implemented, short of inspecting the guts which a black box won’t allow for, one needs to look at error behavior. I.e. examine the outputs in response to invalid/undefined/distorted/erroneous/unusual inputs and assorted response times. For a simple example, read a sheet of text and start rotating it while you continue reading. For untrained people, reading speed slows down as the rotation angle increases, indicating that the brain engages in counter rotation transformations which are linear in complexity with increasing angles.

At that point I started to develop an interest in error analysis and research around that field, e.g. leading to discoveries like the research around “error-friendliness” in technological or biological systems or discoveries of studies on human behavior which implies corollaries like:

Displaying heuristic behavior, people must make errors by design. So trying to eliminate or punish all human error is futile, aiming for robustness and learning from errors instead is much better.

Perfectionism is anti-evolutionary, it is a dead end not worth striving for. For something “perfect” lacks flexibility, creativity, robustness and cannot be improved upon.

A Black Swan?
Now “Black Swan” defines the notion of a high-impact, low-probability event, e.g. occurring in financial trading, people’s wealth or popularity – events from an extreme realm. That’s in contrast to normally distributed encounters like outcomes of a dice game, people’s body size or the number of someone’s relatives – encounters from a mediocre realm.

From Mediocre…
Here’s a short explanation for the mediocre realms. Rolling a regular dice will never give a number higher than 6 no matter how often it’s thrown. In fact, the more it’s thrown, the more even it’s numbers are distributed and the clearer its average emerges. Measuring people’s weight or number of relatives shows a similar pattern to throwing a dice, the more measurements are encountered the more certain the average becomes. Any new encounter is going to have lesser and lesser impact on the average of the total as the number of measurements increases.

To Extreme…
On the other hand there are the extreme realms. In trading or wealth or popularity, a single encounter can outweigh the rest of the distribution by several orders of magnitude. Most people have an annual income of less than $100k, but the tiny fraction of society that earns more in annual income possesses more than 50% of the entire distribution of wealth. A similar pattern exists with popularity, only very few people are so popular that they’re known by hundreds of thousands or maybe millions of people. But only very very few people are super popular so they’re known by billions. Averaging over a given set only works for so long, until a high-impact “outlier” is encountered that dominates the entire distribution. Averaging the popularity of hundreds of thousands of farmers, industrial workers or local mayors cannot account for the impact on the total popularity distribution by the encounter of a single Mahatma Gandhi.

On ErrorsTaleb is spending a lot of time in the book on condemning the application of the Gauss distribution in fields that are prone to extreme encounters especially economics. Rightfully so, but I would have enjoyed learning more about examples of fields that are from the extreme realms and not widely recognized as such. The crux of the inapplicability of the Gauss distribution in the extreme realms lies in two things:

Small probabilities are not accurately computable from sample data, at least not accurately enough to allow for precise decision making. The reason is simple, since the probabilities of rare events are very small, there simply cannot be enough data present to match any distribution model with high confidence.

Rare events that have huge impact, enough impact to outweigh the cumulative effect of all other distribution data, are fundamentally non-Gaussian. Fractal distributions may be useful to retrofit a model to such data, but don’t allow for accurate predictability. We simply need to integrate the randomness and uncertainty of these events into our decision making process.

Aggravation in the Modern Age
Now Taleb very forcefully articulates what he thinks about economists applying mathematical tools from the mediocre realms (Gauss distribution, averaging, disguising uncertain forecasts as “risk measurements”, etc) to extreme realm encounters like trade results and if you look for that, you’ll find plenty of well pointed criticism in that book. But what struck me as very interesting and a new excavation in an analytical sense is that our trends towards globalisation and high interconnectedness which yield ever growing and increasingly bigger entities (bigger corporations, bigger banks, quicker and greater popularity, etc) are building up the potential for rare events to have higher and higher impacts. E.g. an eccentric pop song can make you much more popular these days on the Internet than TV could do for you 20 years ago. A small number of highly interconnected banks these days have become so big that they “cannot be allowed to fail”.

We are all Human
Considering how humans are essentially functioning as heuristic and not precise systems (and for good reasons), every human inevitably will commit mistakes and errors at some point and to some lesser or larger degree. Now admitting we all error once in a while, exercising a small miscalculation during grocery shopping, buying a family house, budgeting a 100 people company, leading a multi-million people country or operating a multi-trillion currency reserve bank has of course vastly different consequences.

What really got me
So the increasing centralisation and increasing growth of giant entities ensures that todays and future miscalculations are disproportionally exponentiated. In addition, use of the wrong mathematical tools ensures miscalculations won’t be small, won’t be rare, their frequency is likely to increase.

Notably, global connectedness alerts the conditions for Black Swan creation, both in increasing frequency and increasing impact whether positive or negative. That’s like our modern society is trying to balance a growing upside down pyramid of large, ever increasing items on top of its head. At some point it must collapse and that’s going to hurt, a lot!

Take Away
The third edition of the book closes with essays and commentary that Taleb wrote after the the first edition and in response to critics and curios questions. I’m always looking for relating things to practical applications, so I’m glad I got the third edition and can provide my personal highlights to take away from Taleb’s insights:

Avoid predicting rare eventsThe frequency of rare events cannot be estimated from empirical observation because of their very rareness (i.e. calculation error margin becomes too big). Thus the probability of high impact rare events cannot be computed with certainty, but because of the high impact it’s not affordable to ignore them.

Focus on impact but not probability
It’s not useful to focus on the probability of rare events since that’s uncertain. It’s useful to focus on the potential impact instead. That can mean to identify hidden risks or to invest small efforts to enable potentially big gains. I.e. always consider the return-on-investment ratio of activities.

Rare events are not alike (atypical)
Since probability and accurate impact of remote events are not computable, reliance on rare impacts of specific size or around specific times is doomed to fail you. Consequently, beware of others making related predictions and/or others relying them.

Strive for variety in your endeavorsAvoiding overspecialization, learning to love redundancy as well as broadening one’s stakes reduces the effect any single “bad” Black Swan event can have (increases robustness) and variety might enable some positive Black Swan events as well.

What’s next?
The Black Swan idea sets the stage for further investigations, especially investigation of new fields for applicability of the idea. Fortunately, Nassim Taleb continues his research work and has meanwhile published a new book “Antifragile – Things that Gain from Disorder”. It’s already lying next to me while I’m typing and I’m happily looking forward to reading it.

The notion of incomputable rare but consequential events or “errors” is so ubiquitous that many other fields should benefit from applying “Black Swan”- or Antifragile-classifications and corresponding insights. Nassim’s idea to increase decentralization on the state level to combat escalation of error potentials at centralized institutions has very concrete applications at the software project management level as well. In fact the Open Source Software community has long benefited from decentralized development models and through natural organization avoided giant pitfall creation that occur with top-down waterfall development processes.

Algorithms may be another field where the classifications could be very useful. Most computer algorithm implementations are fragile due to high optimization for efficiency. Identifying these can help in making implementations more robust, e.g. by adding checks for inputs and defining sensible fallback behavior in error scenarios. Identifying and developing new algorithms with antifragility in mind should be most interesting however, good examples are all sorts of caches (they adapt according to request rates and serve cached bits faster), or training of pattern recognition components where the usefulness rises and falls with the variety and size of the input data sets.

Conclusion
The book “Black Swan” is definitely a highly recommended read. However make sure you get the third edition that has lots of very valuable treatment added on at the end, and don’t hesitate to skip a chapter or two if you find the text too involved or side tracking every once in a while. Taleb himself gives advice in several places in the third edition about sections readers might want to skip over.

Have you read the “Black Swan” also or heard of it? I’d love to hear if you’ve learned from this or think it’s all nonsense. And make sure to let me know if you’ve encountered Black Swans in contexts that Nassim Taleb has not covered!

About

Determine candidates and delete from a set of directories containing aging backups. As a follow up to the release of sayebackup.sh last December, here’s a complimentary tool we’re using at Lanedo. Suppose a number of backup directories have piled up after a while, using sayebackup.sh or any other tool that creates time stamped file names:

Which file should be deleted once the backup device starts to fill up? Sayepurge parses the timestamps from the names of this set of backup directories, computes the time deltas, and determines good deletion candidates so that backups are spaced out over time most evenly. The exact behavior can be tuned by specifying the number of recent files to guard against deletion (-g), the number of historic backups to keep around (-k) and the maximum number of deletions for any given run (-d). In the above set of files, the two backups from 2011-07-07 are only 6h apart, so they make good purging candidates, example:

For day to day use, it makes sense to use both tools combined e.g. via crontab. Here’s a sample command to perform daily backups of /etc/ and then keep 6 directories worth of daily backups stored in a toplevel directory for backups:

Resources

Usage

Usage: sayepurge.sh [options] sources...
OPTIONS:
--inc merge incremental backups
-g <nguarded> recent files to guard (8)
-k <nkeeps> non-recent to keep (8)
-d <maxdelet> maximum number of deletions
-C <dir> backup directory
-o <prefix> output directory name (default: 'bak')
-q, --quiet suppress progress information
--fake only simulate deletions or merges
-L list all backup files with delta times
DESCRIPTION:
Delete candidates from a set of aging backups to spread backups most evenly over time, based on time stamps embedded in directory names. Backups older than <nguarded> are purged, so that only <nkeeps> backups remain. In other words, the number of backups is reduced to <nguarded> + <nkeeps>, where <nguarded> are the most recent backups. The puring logic will always pick the backup with the shortest time distance to other backups. Thus, the number of <nkeeps> remaining backups is most evenly distributed across the total time period within which backups have been created. Purging of incremental backups happens via merging of newly created files into the backups predecessor. Thus merged incrementals may contain newly created files from after the incremental backups creation time, but the function of reverse incremental backups is fully preserved. Merged incrementals use a different file name ending (-xinc).

See Also

Performance of a C++11 Signal System

First, a quick intro for for the uninitiated, signals in this context are structures that maintain a lists of callback functions with arbitrary arguments and assorted reentrant machinery to modify the callback lists and calling the callbacks. These allow customization of object behavior in response to signal emissions by the object (i.e. notifying the callbacks by means of invocations).

Over the years, I have rewritten each of GtkSignal, GSignal and Rapicorn::Signal at least once, but most of that is long a time ago, some more than a decade. With the advent of lambdas, template argument lists and std::function in C++11, it became time for me to dive into rewriting a signal system once again.

So for the task at hand, which is mainly to update the Rapicorn signal system to something that fits in nicely with C++11, I’ve settled on the most common signal system requirements:

Signals need to support arbitrary argument lists.

Signals need to provide single-threaded reentrancy, i.e. it must be possible to connect and disconnect signal handlers and re-emit a signal while it is being emitted in the same thread. This one is absolutely crucial for any kind of callback list invocation that’s meant to be remotely reliable.

Signals should support non-void return values (of little importance in Rapicorn but widely used elsewhere).

Signals can have return values, so they should support collectors (i.e. GSignal accumulators or boost::signal combiners) that control which handlers are called and what is returned from the emission.

Signals should have only moderate memory impact on class instances, because at runtime many instances that support signal emissions will actually have 0 handlers connected.

For me, the result is pretty impressive. With C++11 a simple signal system that fullfils all of the above requirements can be implemented in less than 300 lines in a few hours, without the need to resort to any preprocessor magic, scripted code generation or libffi.

I say “simple”, because over the years I’ve come to realize that many of the bells and whistles as implemented in GSignal or boost::signal2 don’t matter much in my practical day to day programming, such as the abilities to block specific signal handlers, automated tracking of signal handler argument lifetimes, emissions details, restarts, cancellations, cross-thread emissions, etc.

Beyond the simplicity that C++11 allows, it’s of course the performance that is most interesting. The old Rapicorn signal system (C++03) comes with its own set of callback wrappers named “slot” which support between 0 and 16 arguments, this is essentially mimicking std::function. The new C++11 std::function implementation in contrast is opaque to me, and supports an unlimited number of arguments, so I was especially curious to see the performance of a signal system based on it.

I wrote a simple benchmark that just measures the times for a large number of signal emissions with negligible time spent in the actual handler.

I.e. the signal handler just does a simple uint64_t addition and returns. While the scope of this benchmark is clearly very limited, it serves quite well to give an impression of the overhead associated with the emission of a signal system, which is the most common performance relevant aspect in practical use.

Without further ado, here are the results of the time spent per emission (less is better) and memory overhead for an unconnected signal (less is better):

Signal System

Emit() in nanoseconds

Static Overhead

Dynamic Overhead

GLib GSignal

341.156931ns

0

0

Rapicorn::Signal, old

178.595930ns

64

0

boost::signal2

92.143549ns

24

400 (=265+7+8*16)

boost::signal

62.679386ns

40

392 (=296+6*16)

Simple::Signal, C++11

8.599794ns

8

0

Plain Callback

1.878826ns

-

-

Here, “Plain Callback” indicates the time spent on the actual workload, i.e. without any signal system overhead, all measured on an Intel Core i7 at 2.8GHz. Considering the workload, the performance of the C++11 Signals is probably close to ideal, I’m more than happy with its performance. I’m also severely impressed with the speed that std::function allows for, I was originally expecting it to be at least a magnitude larger.

The memory overhead gives accounts on a 64bit platform for a signal with 0 connections after its constructor has been called. The “static overhead” is what’s usually embedded in a C++ instance, the “dynamic overhead” is what the embedded signal allocates with operator new in its constructor (the size calculations correspond to effective heap usage, including malloc boundary marks).

The reason GLib’s GSignal has 0 static and 0 dynamic overhead is that it keeps track of signals and handlers in a hash table and sorted arrays, which only consume memory per (instance, signal, handler) triplet, i.e. instances without any signal handlers really have 0 overall memory impact.

Summary:

If you need inbuilt thread safety plus other bells and can spare lots of memory per signal, boost::signal2 is the best choice.

For tight scenarios without any spare byte per instance, GSignal will treat your memory best.

If you just need raw emission speed and can spare the extra whistles, the C++11 single-file simplesignal.cc excels.

For the interested, the brief C++11 signal system implementation can be found here: simplesignal.ccThe API docs for the version that went into Rapicorn are available here: aidasignal.hh

PS: In retrospect I need to add, this day and age, the better trade-off for Glib could be one or two pointers consumed per instance and signal, if those allowed emission optimizations by a factor of 3 to 5. However, given its complexity and number of wrapping layers involved, this might be hard to accomplish.

About

Due to popular request, I’m putting up a polished version of the backup script that we’ve been using over the years at Lanedo to backup our systems remotely. This script uses a special feature of rsync(1) v2.6.4 for the creation of backups which share storage space with previous backups by hard-linking files.The various options needed for rsync and ssh to minimize transfer bandwidth over the Internet, time-stamping for the backups and handling of several rsync oddities warranted encapsulation of the logic into a dedicated script.

Usage

Usage: sayebackup.sh [options] sources...OPTIONS: --inc make reverse incremental backup --dry run and show rsync with --dry-run option --help print usage summary -C <dir> backup directory (default: '.') -E <exclfile> file with rsync exclude list -l <account> ssh user name to use (see ssh(1) -l) -i <identity> ssh identity key file to use (see ssh(1) -i) -P <sshport> ssh port to use on the remote system -L <linkdest> hardlink dest files from <linkdest>/ -o <prefix> output directory name (default: 'bak') -q, --quiet suppress progress information -c perform checksum based file content comparisons --one-file-system -x disable crossing of filesystem boundaries --version script and rsync versionsDESCRIPTION: This script creates full or reverse incremental backups using the rsync(1) command. Backup directory names contain the date and time of each backup run to allow sorting and selective pruning. At the end of each successful backup run, a symlink '*-current' is updated to always point at the latest backup. To reduce remote file transfers, the '-L' option can be used (possibly multiple times) to specify existing local file trees from which files will be hard-linked into the backup.Full Backups: Upon each invocation, a new backup directory is created that contains all files of the source system. Hard links are created to files of previous backups where possible, so extra storage space is only required for contents that changed between backups.Incremental Backups: In incremental mode, the most recent backup is always a full backup, while the previous full backup is degraded to a reverse incremental backup, which only contains differences between the current and the last backup.RSYNC_BINARY Environment variable used to override the rsync binary path.

See Also

For a while now, I’ve been maintaining my todo lists as backlogs in a Mediawiki repository. I’m regularly deriving sprints from these backlogs for my current task lists. This means identifying important or urgent items that can be addressed next, for really huge backlogs this can be quite tedious.

A SpecialPage extension that I’ve recently implemented now helps me through the process. Using it, I’m automatically getting a filtered list of all “IMPORTANT:”, “URGENT:” or otherwise classified list items. The special page can be used per-se or via template inclusion from another wiki page. The extension page at mediawiki.org has more details.

Like every year, I am driving to Berlin this week to attend LinuxTag 2012 to attend the excellent program. If you want to meet up and chat about projects, technologies, Free Software or other things, send me an email or leave a comment with this post and we will arrange for it.

About Testbit Tools The ‘Testbit Tools’ package contains tools proven to be useful during the development of several Testbit and Lanedo projects. The tools are Free Software and can be redistributed under the GNU GPLv3+.

This release features the addition of buglist.py, useful to aid in report and summary generation from your favorite bugzilla.

What’s this? Wikihtml2man is an easy to use converter that parses HTML sources, normally originating from a Mediawiki page, and generates Unix Manual Page sources based on it (also referred to as html2man or wiki2man converter). It allows developing project documentation online, e.g. by collaborating in a wiki. It is released as free software under the GNU GPLv3. Technical details are given in its manual page: Wikihtml2man.1.

Why move documentation online? Google turns up a few alternative implementations, but none seem to be designed as a general purpose tool. With the ubiquituous presence of wikis on the web these days and the ease of content authoring they provide, we’ve decided to move manual page authoring online for the Beast project. Using Mediawiki, manual pages turn out to be very easily created in a wiki, all that’s then needed is a backend tool that can generate Manual Page sources from a wiki page. Wikihtml2man provides this functionality based on the HTML generated from wiki pages, it can convert a prerendered HTML file or download the wiki page from a specific URL. HTML has been choosen as input format to support arbitrary wiki features like page inclusion or macro expansion and to potentially allow page generation from other wikis than MediaWiki. Since wikihtml2man is based purely on HTML input, it is of course also possible to write the Manual Page in raw HTML, using tags such as h1, strong, dt, dd, li, etc, but that’s really much less convenient to use than a regular wiki engine.

What are the benefits? For Beast, the benefits of moving some project documentation into an online wiki are:

We increase editability by lifting review requirements.

We are getting quicker edit/view turnarounds, e.g. through use of page preview functionality in wikis.

We allow assimilation of user contributions from non-programmers for our documentation.

Easier editability may lead to richer documentation and possibly better/frequently updated documentation.

What are the downsides? We have only recently moved our pages online and still need to gather some experience with the process. So far possible downsides we see are:

Sources and documentation can more easily get out of sync if they don’t reside in the same tree. We hope to be mitigating this by increasing documentation update frequencies.

Confusion about revision synchronization, with the source code using a different versioning system than the online wiki. We are currently pondering automated re-integration into the tree to counteract this problem.

How to use it? Here’s wikihtml2man in action, converting its own manual page and rendering it through man(1):

Have feedback or questions? If you can put wikihtml2man to good use, have problems running it or other ideas about it, feel free to drop me a line about it. Alternatively you can also add your feedback and any feature requests to the Feature Requests page (a forum will be created if there’s any actual demand).

What’s related? We would also like to hear from other people involved in projects that are using/considering wikis to build production documentation online (e.g. in manners similar to Wikipedia). So feel free to leave a comment about your project if you do something similar.