Thursday, May 24, 2012

Today I have shipped my first ever open source library. It's so small and tiny that I am not sure it qualifies to be called a library, it doesn't even have a built script yet but it's definitely open and it's definitely shipped. It's out there on my github page. Meet SAXDVR. In my defense, it has a documentation page and it has unit tests. And it has a purpose.

The question at StackOverflow made me sit down and play with SAX filters. Couple hours into it and I had it working so I thought why not make one more step and put it out there. It felt great! Don't get me wrong, it will hardly be used by anybody other than as an illustration to my answer, but it did feel like a big deal to me. Ship!, doesn't matter small or large, and you will likely feel the same.

I realize that it's a happy path with no error checks but it works. I figured I was relying on the commutative property of assignment and multiplication so I decided to reverse the collected arguments:

The friend of mine didn't accept the solution and asked me to keep working on it to get rid of the loop. Iteration can easily be converted into a recursion and the result didn't take long to produce. My friend, however, didn't accept it either and asked me to get rid of arrays and get rid of working with the arguments (which isn't really an array but that's another story). The map function, he said, has arity of one so implement it as such, don't let the arguments collect behind the scenes.
Well, I have to admit I was stuck. I knew I needed to somehow stack the closures and return functions of functions but the solution was out of my reach. I gave up.
Here's his pure functional solution that he gladly shared with me after I admitted I couldn't get it:

I see my school math. And to really get the functional solution to the make challenge I'd better had seen a recursion in it. My friend recommended the following reading to up my familiarity and awareness of those pure functional concepts:

Monday, March 19, 2012

I needed to get the files from under a file copy of a GIT repo. For some reason cloning into a remote repo didn't work, the remote machine kept hanging up on me. But I was able to scp into the machine and just grab the repo files. All I needed really was a quick look into someone else's work and figuring out how to properly clone the remote repo would take me longer so I figured what's the heck.

Here's my recipe:

copy the repo to some local location. let it be <REPO>

create a folder where you want your "true" local clone to reside. let it be <LOCAL>

Monday, March 05, 2012

I am working on this small home project of my own exploring various aspects of Ruby language while building something useful for myself. I will tell more about what it is once (if :) it's ready.

Part of the techniques and things that I am trying is TDD. Like I heard the other day on the Ruby Rogues podcast, the right response to "testing is hard" is not "let's don't test", it's "let's get better at it". I also wanted to see if I really can put my mind in a position to think "backwards" when designing an API. When you start with a test you basically first define how you use it and only then go build it. In theory it should lead to a cleaner easier to use API as well as help with the whole YAGNI thing. Instead of doing all those extra behaviors the users might never need you would focus on the critical things that make it useful.

I like my TDD experience so far. I find that it disciplines me to focus on things that matter. Without writing the test first I am more likely to think wild and do some yak shaving, you know. It's likely much less apparent that you're building a not needed feature or doing a premature optimization than it is that you're writing a meaningless test. Starting with the test makes me think of what I need and then I am obligated to build that thing that I need or my tests suite fails. The whole red-green-refactor cycle. It's fun. You should try it.

Having done it for a little while I thought I would measure my test coverage. Enough was said about how it doesn't necessarily mean a lot to have your code "covered", how striving for coverage alone you get your code exercised but not necessarily tested, how there are different levels of coverage, etc. Still I wanted to measure.

Ruby 1.9 comes with a built-in experimental coverage feature that does runtime instrumentation and collects the metrics. There are simple gems out there that leverage it to get the data out and build coverage reports and cover_me is one of them. You basically require 'cover_me' in your test suite and off you go. Almost.

If you do just that you will have the coverage.data file but no reports. Here's their github issue #20 that talks just about that. Running CoverMe.complete! from irb or pry is one of the recommended solution, as well as doing rake cover_me:complete. I did like neither as I wanted it all "out of the box" and working even when I simply do ruby test/test_all.rb. And here comes the beauty of Ruby being dynamic and interpreted language. Any external library you have in your gems can be opened up and looked into. A few minutes (or hours, or days :) of looking through the code and you know what it does and what it needs to do what you want. I first posted a dirty way of doing it and then came up with a much cleaner solution that doesn't require tapping into the cover_me sources. Maybe someone will find it useful. I guess the best way would be to wrap it all into a rake task and just run my test suite along with the coverage steps as one rake command. Next time.

Tuesday, February 28, 2012

Just yesterday I started the what I learned today series and I'm already thinking it should have been a weekly digest. Not that I didn't learn anything new (I, in fact, did learn a few things that I will share in a second), I just figured it would be too much noise to publish a digest daily. So weekly it is.

Today I instead want to talk about metaprogramming in Ruby. I actually not so much want to talk about it as I want to put a summary for my future self. Like a collection of links that finally made me get it. The links are in no particular order:

A great episode of the Ruby Rouges podcast about what metaprogramming is, how it is different from monkey patching, when and why and how you would do it, and beyond. Great stuff. I would actually highly recommend this podcast even to folks not doing Ruby coding on a daily basis (I am not, for example). I have recently picked it up and my commute just got so much better. Haven't yet caught up on all of the episodes.

I am sure there are more decent blog posts out there on what metaprogramming in Ruby is about and how to do it right and when it is right to do it so please add it to my collection.

UPDATE. As I was wrapping up this one I came across two more very interesting and very relevant posts:

awesome writeup on include and included. It goes into showing off bizarre examples of massive metaprogramming in ActiveRecord as well as some conscious evolution of rails in doing it more right than it was doing it before

A response to the post I just mentioned that brings up an interesting point of namespace pollution. While I am ok protecting the global namespace in JavaScript with the "closure everywhere" (function(...){})(); pattern, I am not sure I see the point in giving up syntax convenience and protecting my class namespace in the same way.

There shouldn't be a day in your life when you don't learn something new.

I believe we all learn every day and just not always think about it. I thought I would start a daily digest of things I learned "today" to help me better remember it and to also share it with you as some may find it useful.

Some of it comes from the things I did while working on a problem in the office, some comes from reading or listening to other people, some comes from proof reading my son's homework. There's clearly different depth to the whole notion of "I learned it" so I by no means say that I am now an expert at the things I learned.

Too much "learned" already so with no further due today I have learned:

Groovy way of grabbing Maven dependencies via Grape and @Grab import annotations isn't searching in your local .m2 repo. You will need to make an explicit configuration in your .groovy/grapeConfig.xml to make it do a trip "home".

Cassandra's secondary index is a hash index so all you can ask it is the equality query. you can't do a range query on it and bitmap secondary indexes are not yet on the roadmap

The way you bulk load data into Cassandra is via building SSTable files directly and pushing them to the cluster. Nothing new really, all "databases" do direct datafiles writes for fast bulk inserts but here you literally create datafiles "increments"

Ruby has a define operator that you can use to find out if super is defined for the method you're in so you can do things like super if defined?(super). Pretty neat

Friday, February 17, 2012

Last night coding session, that I kind of waited for the whole evening while putting my kids to bed, quickly turned out to be the "what the heck is going on" session.

I have this small home project where I am building a Ruby library wrapper around a public JSON API. It's half-fun-half-learning exercise so I am trying to stay down to bare metal Ruby without unnecessarily introducing gems that I don't really need. So I am using net/http to interact with the API endpoint and it's been all fine until yesterday.

My code fires up the request, stucks the response right into JSON.parse() and off it goes. As I build this new class to represent a new entity my tests start to fail with :

JSON::ParserError: 743: unexpected token at '▼'

The saying goes that there are two categories of developers when it comes to debugging. Faced with an error, one would fire up the debugger and dive right into it, while another one would sprinkle puts statements and just follow the trails. I am a member of the second group. If you think it's not right then I would encourage you to listed to the Debugging in Ruby episode on the Ruby Rogues podcast. There are cases when debugger is the way to go, but I am convinced that working on your own code with a good test coverage (like it should be!) you're better off finding the problem with your brain and some help from the puts method.

As I was inspecting the response object that JSON parser choked on I got to learn the Ruby 1.9 String#encoding and some other very interesting things. Somehow I remembered early 2000s when I was actively writing Java for the web and dealing with encoding to support multilingual was a black belt art. The database, the IO streams, the XSLT transformations, HTTP headers and HTML meta tags, and what have you. Dejavu. Only it wasn't.

Well, if I did move to Mechanize it wouldn't have helped me. It wasn't the encoding issue. It turned out to be exactly what my code was asking for and all I had to do was to teach my code not to ask for what it couldn't chew on.

Sending JSON requests via net/http I was pretending to be a browser. Not that I needed it, but since I had to send some state in cookie headers anyway I figured I would just pretend I am a full blown browser client. Among other harmless things that I carried over from the Firebug Net log was the following rather important "expectation":

Accept-Encoding=gzip, deflate;

I was telling the server that I am ready to process zipped content and I wasn't. All previous JSON requests that I sent were small enough for the server not to bother zipping it so I had no issues. And then I asked for something big enough and I said I am ok with the things zipped. The server was built by some good guys who knew their stuff and they did just like I asked. Removing that single line and not telling the server I am ok with the things zipped did the trick... Then I thought how silly it was of me not to get puzzled from the byte stream that I saw when I first time dumped the response object to the STDOUT. it must have been too late already :)

Even though I didn't have good time coding the way I planned it, I had a terrific time troubleshooting. It's through exercises like this you learn to be a better software engineer - I got exposed to so many new things thanks to this little trap I got myself into. It's basically like bugs-driven-learning :)

well, if you're tired from wrapping all your anonymous functions into extra ( ) here's a shortcut for you:

!function($) {
...
}(jQuery);

Getting boolean out of virtually anything

Say you have something passed to your function or there's something you receive from another function and all you need to know is that it's "there". You basically need "true" if it exists or "false" otherwise. Here's a neat shortcut to do so:

myFlag = !!varToVerify;

The double negate will yield true for anything but a boolean false, NaN, or undefined (which will throw an error). Infinity will yield true. You can test it on jconsole:

Tuesday, January 24, 2012

I like music and rarely work without a nice track playing in the background. Pandora and MOG are always with me, on my laptop, on my iPhone, on my Kindle. I believe my TV has a Pandora app and I saw one running on my neighbor's fridge (fancy that). Last week I figured I would check out Spotify just to make sure I wasn't missing out on something even better. Facebooking is not something I do so that's how I missed all the buzz I guess. Anyway, long story short, I didn't chose Spotify over MOG but I did find myself a puzzle to spare a few night hours with.

These guys posted a few puzzles for their candidates to try before one would consider applying for an engineering job at Spotify. As of time of this writing there are three, each labeled with a music genre - reggae, funk, and heavy metal. I would almost always prefer heavy metal over reggae and funk given no other choice so I knew which one I would tickle my brain with.

The Bilateral Projects is about finding an optimal solution to what sounds like an easy problem. You have project teams of exactly two individuals each from one of two different offices and you need to invite the smallest amount of people to a company conference to save on expenses while having all projects represented. Same people can work on more than one project hence a potential for savings. Should the problem have more than one optimal solution you would pick one that would invite your friend otherwise you wouldn't really care. Oh, and a very important detail as we will soon learn. The input may ask you to optimize for as many as 10,000 project teams.

Java

First, let's do a careful, defensive-style object oriented brute force solution in Java. We would find all possible solutions, reduce the list to only optimal ones, and finally pick one that ideally has our friend in it. Clearly not a solution for large data sets but a good way to test the waters and see what we've got here.

Note: if you're here for a solution then you should scroll down to the Dominating Set chapter, the Java part won't be of much interest to you. O(2^n) isn't really a solution for anything larger than a handful of teams.

The full code is on github and here's the important pieces. Input data is parsed into a collection of Team objects each referencing two Employee objects (we could clearly work with primitive arrays but that's something I left for Ruby). It's now time to build a list of all possible solutions to the puzzle. There are algorithms to build all combinations and permutations but we need a slightly different thing that can be expressed in a much simpler way. We only need one of the two team members from each team which means 2^n choices that we can visualize as a binary tree. A tree as deep as many teams we have branching out and doubling the nodes on each level to have 2^n on the bottom leaf level. Take a look.

Each line in the dataset on the left represents a team of two employees referenced by their IDs. Full solutions tree would look like the one to the right:

And the code does exactly that. It loops through all teams, accumulates a List<Set<Employee>> (a path from the root down to the end leaf node is a set of nodes you need to walk through, and outer list as a collection of all possible ways down), creates a new solution branch on each step adding employees from the team as leaf "nodes":

// SPEC: If possible (subject to the set of people being smallest possible),
// the list of invitees should include your friend
final Employee myFriend = new Employee(MY_FRIEND_ID);
// reduce to those having our "friend"
final List<Set<Employee>> bestSolutions = new ArrayList<Set<Employee>>();
for (Set<Employee> s : optimalSolutions) {
if (s.contains(myFriend)) {
bestSolutions.add(s);
}
}
final List<Set<Employee>> solutions = !bestSolutions.isEmpty() ? bestSolutions : optimalSolutions;
final Set<Employee> solution = solutions.get(0);

Given the data set we used to illustrate the solutions tree the code would find a 5 people solution: 2002, 2021, 2038, 2024, 2031.

Dominating Set

Before we move on to other languages and potentially better ways to solve the heavy metal puzzle let's see what we're dealing with. If we plot employees on a chart and cross-link people who work together we'll end up with a graph (or a set of graphs) with employees as vertexes and projects and edges. Here's one for the dataset we've already looked at:

I used orange coloring for the employee nodes that represent the optimal solution. You see how all other nodes not colored are connected to at least one colored node by an edge? It make sense, right? We want to pick employees to represent all projects so we should have all edges "covered" by the highlighted set. In graph theory a set like this is said to be a dominating set. And there's apparently a dominating number that represents the number of vertices in a smallest dominating set... As wikipedia says, the dominating set problem [..] is a classical NP-complete decision problem [..] therefore it is believed that there is no efficient algorithm that finds a smallest dominating set for a given graph.

Right. Why would otherwise folks over at Spotify have called this one a "heavy metal" puzzle? :)

Ruby

My original idea was to illustrate that one would use much less words to express same thing in Ruby but rewriting a brute force solution in Ruby is no fun so let's optimize it a bit. Instead of building a list of all potential solutions let's just try to get straight to that optimal one.

The idea is to:

Try to figure out who is more likely to make it to the final list of attendees

Add that likely candidate to the attendees list and reduce the list of projects removing all that the selected candidate works on.

As we reduce the projects list we will keep notes of other people who participate in those being removed. This way we will know we don't need them unless they represent other projects

For the teams that don't have a clean "winner" we would take either one giving a preference to our friend (if he or she is one of the two) otherwise we'll try to balance between the two offices

Our Solver will keep track of projects, employees, and selected attendees (oh, and full listing is of course available over on github):

class Solver
attr_accessor :projects, :employees, :attendees
end

When solver initializes itself from the input data it builds a list of projects (a project is a simple two elements array and a list of projects is an array of project arrays), and also calculates how many projects each employee participates in. The latter we would keep in a hash with employee ID as a key and a number of projects as a value:

That's it. A bit of "side effect" programming and we're done. It generates a slightly different though still optimal solution: 2002, 2021, 2024, 1003, 2031.

Note: project list is sorted in place using the ! version of the sort_by to keep subsequent sorting close to a best case O(n). There's still room for improvement though: we could stop recursing and re-sorting once we know the rest of the list are teams that don't have "shared" resources on them. We could also do a more intelligent sorting to push project teams where people are equally shared across more than one projects first when there's no more "unbalanced" teams. I am jumping ahead of myself a little bit but my tests didn't show any noticeable difference from those improvements so I voted for smaller and cleaner code, not the code that pretends to be "smarter" when it's not.

Prolog

I have to admit I spent a few evenings "learning" it before I could express a solution to the puzzle in a declarative way. I thought the experience I had with inference systems in a comfortable environment of imperative languages (JBoss Drolls, for example) will be enough to just "get" it and boy, was I wrong. It's predicates, not functions; it's unification, not assignments; it's terms and atoms, not variables; and on and on it goes. If you never programmed the logical way I dare you to try! Like Bruce Tate wrote about Prolog in his Seven Languages in Seven Weeks: "if this is your first exposure, I guarantee either you will change the way you think or you'll fail". A good prime on predicate calculus might also help.

Prolog's approach to solving any kind of problem you would throw at it is all about permutations. It's optimized to proof search and backtrack the decision tree so we know the puzzle will be solved with the "brute force"'s O(2^n) "complexity". Let's see:

Firs, we define the attendees rule. It starts with a terminate empty goal to make sure the recursion has where to stop. Then we say that either one of the two employees is a good solution for a one team puzzle, and then we recursively solve it for a list of teams. attendees works on a list of tuples and in this form would report the entire solutions tree (just like that one we build in Java). The sort predicate at the end will make sure the solution set is free of duplicate employees (those shared across more than one project teams thus listed more than once).

It's now time to find that optimal solution that ideally has our best friend in it. Let's collect the list of all solutions, sort it by length (assuming lists are free of duplicates), push lists with our friend to the front among same length lists, and then just pick the first one. We are going to need a custom sorting predicate:

optimize compares by length when two lists are different in size and, when two lists represent an equally optimal solution, returns the one with our friend in it (if found) or the first one. Here's the solve goal:

It collects all permutations as delivered by the attendees goal, sorts them with the optimize, picks the head of that list and prints its length. When solver.pl is loaded you can ask Prolog a question like this:

You can find projects.pl on github. Note: Prolog dialects vary. I used SWI-Prolog and then tried to run it in GNU Prolog and it didn't work. GNU version didn't like the predsort/3 which I believe is SWI-Prolog "extension".

Let's Rock!

There's not much examples on the Spotify website so I had to generate a few data sets to see how well we can cope with the puzzle, at what point each solution breaks, and whether it finds that optimal solution. The generator that I quickly put together is capable of building various setups based on a few parameters: number of teams/projects, how many employees from each office to consider (a density factor basically and a way to do balanced and unbalanced participation), and finally whether to "pad" it with our friend a few extra times. It produces data files with the names like this - dataset_A_B_C_D - where A is how many teams we asked it to generate, B and C stand for how many Stockholm and London employees it randomized project participation across, and D is for up to how many times it tried to squeeze our friend into the project teams. You can find the generator on github as well.

Let's start with a few simple tests. 20_7_4_0 is a slightly unbalanced setup but it's a good test. We know right of the bat that the optimial solution can't require more than four employees. It can be less if rand(4) didn't spit out all the options though it's unlikely for 20 runs. So let's see: Java solution said it's 4 very quickly, so did Ruby version, so did Prolog.

30_10_10_2 is already large enough to keep Java spinning and eventually run out of heap space. I gave it some more and it ate it all up and was still not satisfied. That was to be expected with exponential complexity and all those Set objects. Prolog ran out of local stack recursing into the solutions tree. Poor guy. Ruby version was very fast telling me it needed only 9 employees. I would expect it at max to be 10 so 9 is believable. 100_30_20_0 reports back it needs 20. 100_80_60_0 says 51. (your experience may vary as data sets are essentially randomized). 100_500_500_0 says 85. Again, I would expect the number to be less than 100. So far so good.

500_50_50_0 reports... 53. 5,000_500_500_0 reports ... 563. And the edge case of 40,000_999_999_0 took a few seconds to generate and about 20 seconds to solve. It reported... 1014. Well, now we know we didn't crack it completely... We'll get back to this one in a minute.

To wrap it up on the algorithmic part, Java and Prolog solutions have at least O(2^n) complexity to collect the solution tree plus n*log(n) sorting (they all do variations of merge sort) and some more overhead. The Ruby version in the worst case would loop n times and each time do n*log(n) of sorting and another n iterations to "reject" the processed project(s). That gives us O(n^2*log(n)) if I am not mistaken. No surprise it can get to the bottom of a 40,000 set. Windows calculator in a scientific view reported "overflow" when I tried to look at 2^40,000. And n^2*log(n) for 40,000 is a little shy of 10 billions which isn't really a big deal to iterate through, especially with the tail recursion keeping it all "flat". I have to admit though that I liked the way Prolog solution "sounded".

So what's up with those large sets producing suboptimal results? Well, I wouldn't expect I could find a fast algorithm to what folks with PhD in computer science believe to be a classical NP-complete problem :) Here comes the last optimization.

Common Sense

If the "optimized" answer after 20 seconds of crunching the numbers is larger than the number of employees from one of the two offices participating in all projects then do the following. If you're still up for some savings and can afford having one of the two offices miss the Barbados "thing" and still be there when you come back then just take the smallest office with you. If you don't think your talent will stick around when you dump them like that then I see another option. Understand that people is the most important asset in your business and if you truly can't take everybody to Barbados then take them all to either London or Stockholm and have some real fun together :)