Reservoir sampling is a great random sampling algorithm every data engineer should know. It’s an algorithm for extracting a random sample of a specified size, over a large unbounded dataset. The data is such that you cannot pull it all in memory at once, and you don’t know how large it will be when taking the sample.

Over on wikipedia you can check out a nice explanation of the idea behind reservoir sampling. It presents the most simple algorithm of reservoir sampling. Let me regurgitate it in C#:

To some this algorithm may seem botched, as the last items have the least likelihood of being selected. However the overall likelihood of being selected is roughly distributed evenly across the entire dataset. This is because the items with more likelihood of being selected earlier on (or absolute certainty for the first k samples, where k = sample count) will have more likelihood of being replaced by a successor sample.

Jeffrey Scott Vitter has published some alternative implementations that optimise the performance of the sampling.

It was about time I geeked it up and setup a DIY media centre for my home. I decided to go for the cheap and cheerful hardware option to power my media centre: Raspberry Pi. Only £25 (for the actual computer)! This post shares part 1 of my journey of setting up a choice raspberry pi media centre: from the research to the implementation and retrospection. I’m very happy with the results.

Choosing the SD Card

After countless hours researching on the internet for which SD cards are the best with Raspberry Pi, I finally reached a decision: “SanDisk 16GB 45MB/s Extreme SDHC Card (SDSDX-016G-X46)”. Most forums and guides will say you should spend the extra cash on the class 10 speed cards over lower class cards, this will give you a nice (and much needed) performance boost all round. But choosing the highest read/write speed cards doesn’t necessarily mean they will out perform lower speed cards during the typical workload of a Raspberry Pi session. Some key points to keep in mind when deciding on an SD card are:

Read/write speed ratings on SD Cards refer to sequential read throughput. This is great if you think you will be running applications that load/save large chunks of data on the SD card (e.g reading videos from the SD card).

My media centre doesn’t have a requirement to perform these types of operations on the SD cards: like most people videos for my media centre will reside on an external storage (SD cards have limited storage).

Typically computers will randomly access the storage device hosting the OS and personal document space (especially when reading/writing to the swap space). So what we really want to look for is throughput based on random access tests.

Raspberry Pi’s SD interface limits the sequential read/write speeds anyway. I’ve only found (from my personal research) that the max read/write throughput is about 2oMB/s on the Pi.

Cheaply manufactured SD cards aren’t built for continuos random read/writes, and so they tend to break (there are some tricks to increase the lifetime of your SD card by configuring the OS, covered in part 2 of this series).

My personal choice was the “SanDisk 16GB 45MB/s Extreme SDHC Card (SDSDX-016G-X46)“, because it ticked all the boxes: UHS-1, 10-Class; 16GB will be more than enough space; and Compatibility (tested ok with both Raspbian and openELEC). Plus I stumbled on this (recent enough) great benchmark review with android: http://www.androidcentral.com/sd-card-showdown-testing-sandisk-extreme

Peripherals

I outlawed buying a conventional PC keyboard and mouse for the raspberry pi because A) it would detract away from its home-media-centre feel, B) it would look lame, and C) it would be just too inconvenient to handle a keyboard and mouse on a couch. I turned to all-in one wireless keyboard and tracker devices.

The Rii mini just seemed too small and a lot of reviews complained about its poor range (can stubble to get more than 1m in some cases). The K400 has a whopping 10m range with good reviews (working with raspberry pi). Also the K400 size is perfect: smaller than a conventional computer keyboard but not too small.

Power and USB

The Raspberry Pi requires at least 700mA of power to run. But once you start attaching other USB devices that don’t provide their own power source, you won’t have enough power to run your rig. What is the solution? Buy a USB hub powered using its own power source. Some key points to think about while you shop for your USB Hub:

You can use the USB hub to power your Raspberry Pi so you only have a single power plug for the whole rig. This is a nice and tidy option. If you choose to power the Pi via the USB then consider:

Ensure the power adaptor provides enough amps for the pi + all additional devices.

Keep away from USB hubs that supply back-voltage. These hubs will back-power the Pi by supplying power through the connected USB port. This is dangerous because unlike the micro-USB power input on the Pi, the normal sized USB ports are not protected by a fuse. So if your USB hub has a upstream back-voltage then you need to ensure it will have some sort of fuse between the USB hub and the Pi, otherwise your Pi with Fry.

The PiHub is a great option because it’s supported by the Raspberry Pi Foundation (no worries about compatibility and quality), the proceeds go toward the foundations development, and it powers the Pi with all USB devices with a single powerful power adaptor.

I didn’t choose PiHub because its damn ugly. Too much of an eyesore for my lounge. In the end I went for the pluggable 7-port USB 2.0 hub – because of its ample power supply, copious amounts of positive reviews with Raspberry Pi, doesn’t look as ugly as PiHub, and it doesn’t have back-voltage on upstream connection.

Also since I already owned a Pi power adaptor I kept using it for powering the Pi (so I have two power supplies for my rig). This probably a bit overkill but there is practically no risk of under supplying the Pi when adding extra devices to my USB hub. All the coords from my Pi are hidden at the back of the TV: so the extra coord clutter is out of view.

Interval trees are an efficient ADT for storing and searching intervals. It is an ADT that probably doesn’t make the top 10 most commonly used collections in computer science. I often see code where a list/array like structure is used to store and searching interval data. Sometimes that is fine – even preferred for the sake of simplicity – but only in cases where the code is rarely run and is not having to handle large volumes of data.

My C# implementation of the IntervalTree uses generics and has the following features:

It uses generics.

It is mutable.

It is backed by a self balancing BST (AVL).

It is xml and binary serializable.

It supports duplicate intervals.

Originally I wrote the tree so that the end point selector was simply a lamba. However lamba’s are not serializable, so I changed the selector to become an interface (and implemented my own Xml serialization methods to handle the interface).

If your interval collections can afford to be immutable, then a better performing solution is to use a centered interval tree instead. Although the centered interval tree accomplishes the same average complexity has the augmented BST, the tree construction is generally faster.

In your quest to find out how you can support XML serialization for types that contain interfaces, you may often find yourself coming to the same answer: you cannot serialize interfaces. That is true, but you can work around it, and I will present two methods.

Of course this has several implications. For example, instantiating these types become more complicated and could lead to tricker wiring/construction problems. Another point to make is the class itself doesn’t guarantee it is XML serializable: if the class is declared with type (T) of an interface (or a non-serializable class) then it won’t be XML serializable. But this should nothing to be surprized about, the .Net frameworks XML serializable generic types also behave like this (e.g. List<T>).

Today’s boredom lead me to solving another programming praxies problem:

Search every power of two below 210000 and return the index of the first power of two in which a target string appears. For instance, if the target is 42, the correct answer is 19 because 219 = 524288, in which the target 42 appears as the third and fourth digits, and no smaller power of two contains the string 42.

The naive solution is simple: keep doubling a number (starting at 1) until you find a sequence that matches the target. Of course once you get 2^64 storing the number as a primitive type will not suffice, where you would need to come up with a solution to store larger numbers.

I improved on the naive approach by caching the number sequences for each exponent. For the cache I used a type of suffix tree:

Unlike your traditional suffix tree, this one does not compress string sequences. It’s really just a trie, where the number in each sequence is stored implicitly as the index in Children (i.e. these arrays are of length 10). However the data structure is populated and access like a suffix tree: where each suffix of a number sequence is inserted into the tree. Each node in the tree is annotated with the smallest exponent that the sequence can be found in.

Today at work I broke some of my team project’s unit tests from a seemingly harmless code change (C#). I simply changed a protected member into an auto-property. Unfortunately the code change was bundled with other changes, for which any other innocent coder would have thought be the changes to blame. But it was the small, innocent (almost code-cosmetic), change that was carried out using a click of the button with Resharper. The test blew up because there was an important piece of code (somewhere) that used reflection to search the class for non-public instance (visible to the class type) member variables and collate members that inherited a specific interface. What is this!? Some silent protocol?? How fragile!

Reflection is a powerful tool that has blessed us with many awesome easy to use API’s. But clearly, it is not suitable to solve all problems. So when should we use reflection? And when should we avoid it? Here are a few common pros and cons that tend to crop up around the topic of reflection (this is not by all means a comprehensive list!):

Good reflection 🙂

Dependency injection frameworks.Reflection has given us killer IoC tools like Windsor and Unity to solve our dependency problems.Clearly refection is a key enabler in the technology, as dependencies and instaintiation is all achievable via binary metadata analysis.

Plugin frameworks.Plugins frameworks commonly use reflection to dynamically load 3rd party plugins, which it could not do so easily without dynamically loading the additional libraries via reflection.

Bad reflection 😦

Refactoring tools and code analysis tools are incompatible.The opening example of this post shows that refactoring tools cannot cover what reflection can do: it can make your code brittle. It’s much better to be explicit with your code design; avoid establishing implicit protocols in your code base which your reflection code requires in order to work correctly. Note that static code analysis tools such as refactoring tools, or features like discovering method usage (e.g. with the reshaper tool) are rendered useless with code that uses reflection. This is a dangerous place to be.

Adds a super-generic layer of indirection.Indirection is a double edged sword: it can improve the design (and yield a number of benefits), but with the cost of adding complexity. The problem with reflection is that it adds a higher degree of indirection than non-reflective code, because it hides static detail such as class names, method names, and property names. So heavy use of reflection makes the program near impossible to run through static code walk throughs. It also can be very difficult to debug.

Run-time errors instead of compile-time errors.This argument can be used for all sorts of mechanisms (such as dynamic type-checking features), but it is a good point to make. If you have the option of a design that doesn’t require reflection, at least you have a chance your compiler will complain if code changes have broken something. A design using reflection is subject to runtime errors, which in the worst case may not be detected until a release cycle (or in production!).

Invocation via reflection is much slower.Generally the performance hit from reflection is neglectable, but in sections of code where reflection is used heavily performance will degrade. Performance is much slower in reflection because during invocation the binaries metadata must be inspected at runtime (rather than being precompiled at compile time).

Conclusion

Avoid reflection.

If you think you need to solve a problem by reflection, rework the design (don’t be lazy!). Also don’t use reflection to get at protected data (e.g. non-public members), violating standard language conventions will get you into all sorts of trouble further down the line. Only use reflection where it is absolutely the only way to meet your needs. An acceptable place to use reflection is where there would be no way – at least without enforcing a difficult/cumbersome protocol to adhere to – to implement a solution. So be wary of the drawbacks of reflection before you get crazy with it, and always strive for a solid design!