Tropical Software Observations

Recently I finished some work using the Go programming language and will share here some things learned while implementing web crawlers in this cool language.

Our Current Implementation and its Weak Points

Our goal was to improve the efficiency and performance of some web crawler scripts that have been running in production for over two years with increasing performance problems, primarily due to CPU usage. Currently we are managing these crawlers to grab data from various web sites. The crawlers were written in Ruby, and the method used is straightforward: first get a page URL or account ID, parse the page content or invoke an API call, persist data to the db, and repeat. The benefits of this approach are that it is simple and straightforward; anyone can easily understand the structure. However, when the size of a task queue grows to be very large, weaknesses emerge. The queued jobs can get backed up waiting for responses, which in turn extends the execution time of a batch significantly.

Built-in Concurrency Support

There are two possible paths to improve our code, one using multiple threads, the other using non-blocking IO and callbacks. After reading some code, I chose to use threads since I find such code to be more readable and maintainable.

goroutines

Golang has a built-in concurrency construct called a goroutine, based on the CSP model (It is said the model is Newsqueak-Alef-Limbo, which has some differences between original CSP, please refer to this slide for some extra information). Here is a brief instruction to goroutines, from the official slides.

An independently executing function, launched by a go statement.

Has its own call stack, which grows and shrinks as required.

It's very cheap. It's practical to have thousands, even hundreds of thousands of goroutines.

Not a thread.

There might be only one thread in a program with thousands of goroutines.

Instead, goroutines are multiplexed dynamically onto threads as needed to keep all the goroutines running.

There are two loops in the code, and each invokes function foo() three times. The difference is the function is launched directly in first loop, while launched via a go statement in the other one. The execution order of the first loop is normal, but the second loops output is out of order. Actually the output from second loop should have an unpredictable order, and you can check this by issuing more goroutines. By the way, the sleep used in Line 19 is to guarantee all goroutines has finished execution, since the main process does not wait for others. There are, of course, some better synchronisation mechanisms available. The full source can be found here.

channels

The goroutine shown above doesn't return data, so it's not a full functional component. Now let's talk about Go's channel construct. Channels are something like a pipe: data is pumped in from one side, and fetched from the other. Let's modify the code above to show how this works. I have removed the normal loop to make the code shorter.

As we can see, there is no sleep() call at the end, but instead a variable foo_channel which pipes data from goroutines back to the main process. The put operation will be blocked if there is already some data in the channel, and the get operation will also be blocked if there is no data in the channel. This is the method that controls function flow. Full source can be found here.

The size of this channel is 1, which makes it like a semaphore. But we can specify another positive integer while creating channel, then the new one should contain buffer function. While pushing data to the channel, the operation will only be blocked if all buffers has been filled. In the other hand, the get operation will only be blocked if all buffers are cleared. This feature can be used to control tje concurrency size of our web crawlers. Here is another example:

This program create two channels, and the concurrency_chan is a buffered channel of size 2. At the beginning of each goroutine, we put a true value into the channel, and receive back at the other end. With these two lines, we can guarantee that only 2 goroutines are ever running at same time. Adding a sleep() can show this more clearly. Again, source code can be found here.

Web Crawlers in Golang

With goroutines, we can implement concurrent web crawlers more easily than with traditional threads. Like with our original architecture, the new approach also has three basic steps: 1) fetch a task from queue server, 2) parse the URL or account ID, then 3) save to the database. The difference is in the second and third steps, now goroutines are used to inexpensively multiplex our calls to external services. Since a goroutine is implemented as a regular function, the changes to our original logic are minor, even trivial. However, the improvement is great. During the testing, we found CPU utilisation for hosts running our crawlers has been reduced by up to 90%. But this is an upper bound and multiple hosts are still needed for backup and to segregate API calls to multiple services.

After a few days of trying out Polymer, here are some of my first impressions.

What is Polymer?

According to its official website, Polymer is a new type of library for the web, built on top of Web Components, and designed to leverage the evolving web platform.

Solving the new <table /> issue

Although we have come a long way from <table /> layouts to the Semantic Web, it's still obvious that we are merely using <div /> or other elements to replace tables. With Polymer, which utilizes Web Components, we can finally realise the real Semantic Web to encapsulate structural details of components in separate files so that our documents are more readable.

You can see in the example above that there isn't much difference between the data binding strategies of AngularJS and Polymer.

Conclusion and first impressions

At the moment Polymer is still quite new for me and I'm still trying to grasp how it works. If fact, the examples I used above might not even be how the Google Developers Group intended it to be used, but I am sure it lays a good foundation for other upcoming web frameworks.

Currently, the official toolchain contains the Clang front end which can only be used to compile C/C++ source to PNaCl applications.
But the PNaCl application is a subset of LLVM so I tried compiling some other languages with LLVM front ends to see if they would work. So far there hasn't been much progress but I'm still working on a solution and this post is just to document my attempts in hopes that others might learn from my experiences.

Toolchain Installation

The PNaCl toolchain can be downloaded via Google's Native Client SDK. Follow the instruction for downloading and installing.
Here's a recap of the commands I used:

Chrome also needed some additional setup which is also described in the SDK page.
Notice that the #enable-nacl-debug flag must not be enabled, or else the PNaCl application will not be able to load (ref). Samples are also avaliable in the Chrome store, and the link can also be found on this page.

Golang and llgo

The first language I tried was Golang. There is a Golang frontend for LLVM called llgo,
and I have tried it to generate LLVM bytecode from my Golang program.

go get is recommended for installation of the llgo package. The CGO_CFLAGS and CGO_LDFLAGS should be set before executing go get.
I followed steps from the author's blog to install:

Install Go 1.1

Get the PNaCl SDK (contains Clang and LLVM)

Run go get github.com/axw/llgo/cmd/llgo-dist

Run llgo-dist to build and install the runtime

I tried this on OSX 10.7.5 but got a "No matching .ll file for asm_darwin_amd64.s" error message.
I filed an issue on Github and learned from the response that I needed to set target flag '-triple' to x86_64-apple-macosx10.7.0 or i386-apple-macosx10.7.0. I then got another error while compiling the runtime with the llgo-dist command, which seems to be due to an upgrade of golang to 1.1.
So I left llgo there and anxiously await new updates from the project.

Ruby

Another attempt is ruby. Since Ruby 2.0.0, NaCl support has been integrated which is concisely documented here.
During compilation of the Ruby source I got an rdoc error, so I added --disable-install-rdoc to the configue command.

Also I made a symbolic link of all directories in the /include SDK directory to Ruby for solving a 'header file not found' error.

$ ln -s /home/vagrant/source/nacl_sdk/pepper_canary/include/* ./

If running this command on any 64-bit OS, please make sure the 32-bit glibc library is installed, or some binaries may cause errors. I successfully compiled the native executable for Ruby, but for some reason which I'm still investigating, it can't be executed by the PNaCl binary.

Next steps

I will still try to write portable native client code with other languages.
I just saw the author of llgo has made some changes in Github so I plan to pull his changes. Stay tuned for another post on my progress. Not giving up yet!

I've had the opportunity to use Meteor for a small project. For those of you who may not know, Meteor is a Javascript web framework built around node.js. It's not ideal (or even intended) for many purposes. Meteor doesn't follow the conventions of other popular frameworks like Rails and Django. Instead, it builds a complete platform with highly integrated client and server side architecture.

Why Meteor?

It's true that Meteor can be used to make just about any type of app, but the real gains come from the magic under the hood which makes it an absolute pleasure to develop real-time single page web apps. The more real time interactivity required for the app, the more likely Meteor is a fit. A great example would be game-like apps. Because of how easy it is to deploy to the free meteor service it's also great for proof of concepts, demos, and internal tools. It's also fairly easy to deploy to heroku after converting the project to a node bundle.

Yet Another Web Server?

No, it's nothing like express, Sinatra, Rails, or any of the many other frameworks you've used. This framework is very opinionated about what you should be doing and how you should be doing it. Meteor stays true its stated principles. Here's a few that I think are worth really noting:

You write your entire app in one language, defining models and code that can be used on the client or server without repeating yourself.

Templates are done with handlebars. Smart, live updates to the clients as the database changes are done automatically.

You write client code to get data through Javascript calls to mongodb. No more getting data from REST API end points.

The framework uses pre-emptive success on database transactions. This gives the appearance of a zero latency connection as things begin to happen immediately and only revert if a change in the data is unsuccessful.

Limitations

Meteor is a very new framework. Much of the API is changing. Much is yet to be written. Some simple tasks in other frameworks, like Ruby on Rails, are a nightmare or even impossible in Meteor. An example of something quite difficult is defining a REST API to access data from 3rd party apps. Or something as simple as using the index of the each loop in the template engine. Because of the dynamic re-rendering it's a real challenge. And because it's new software, the occasional bug will be found.

Stayed tuned for part 2 where I discuss using Meteor addons, creating your own and the community bundler Meteorite (mrt).

Background

The standard procedure for building an app with a selectable language option is to have a set of files (a pack) for each language where the target string in the target language is referenced by a key. The keys are unique within each language pack and the same between packs.

These language packs could, for example, be in YAML files:

en.yml

hi: Hello, world!bye: Goodbye, world.

es.yml

hi: ¡Hola, mundo! bye: Adiós, mundo.

Ruby on Rails includes the i18n gem by default, giving us a very easy way to produce any snippet of language we need. For example, I18n.t "hi" would return the string "Hello, world!" if our environment is English, and "¡Hola, mundo!" if our environment is Spanish.

If we ask for something that is not defined in our language pack, we get a message to that effect, e.g. "translation missing: en.seeyoulater". Note that the message includes the missing key ("seeyoulater"), so we can easily go in and add the necessary content.

The Story

In a recent project, I needed to create an admin page which would allow us to identify content that exists in our main language (English) but not in our secondary language (Spanish). To do this, I needed a way to prevent i18n from generating the translation missing message and instead return something obvious, such as nil.

Fortunately, I18n#t provides an option to raise an exception when content is not found:

I18n.t "seeyoulater", :raise => true

which can then be caught to give us our desired nil.

I18n.t "seeyoulater", :raise => true rescue nil

Problem solved, job done.

Oops

Obviously I wouldn't be writing this post of that were the whole story.

My admin page worked perfectly on the development server, but on production it failed. The two systems are identical; what could possibly be the cause?

A little bit of detective work uncovered two points:

1. There's an i18n option, turned on by default in a production environment, to not generate "translation missing" messages. Instead, when looking for a translation of something that exists in the primary language, i18n will automatically and silently use the original string (e.g. English) in place of the missing translation.

2. This automatic and silent substitution has the unwanted side effect of pre-empting the :raise => true option above. Now our missing-translation detector is broken.

The symptom of the failure on production was that no translations were reported missing, and much of the content that claimed to be in Spanish was actually in English.

Conclusion

In the end, all is well. In the end, we didn't need this functionality.

But I'm bothered by the fact that the default behavior of i18n is to treat exceptions as things to be avoided at all cost. Obviously we don't want uncaught exceptions making their way into our error log (or worse yet, to the browser), but there are valid reasons for throwing and catching exceptions, even on a production server, and this is one of them.

I'm bothered that I've explicitly written into my code that I want an exception, and i18n is pulling rank on me and saying no, you can't have one.

In the end, though, the logical solution is that any translations that need to be managed from the web should ultimately be in stored in a database and managed along with the other dynamic content. Translation data that is stored locally, e.g. in YAML files, should be managed/tested on the development system.

From time to time while developing software, especially on a large project, you find a regression some time after it was introduced. Maybe introduced by someone else on the team. Maybe 50 or 100 commits after it was introduced. For example, perhaps Internet Explorer is buggy and one of our stylesheet changes broke it. We just don't know when the regression was introduced, or exactly what code is causing it. In an ideal world we would have had adequate tests in place to reach that coveted '100% test coverage.' But back in the real world we often have to work with software that doesn't have tests in place, or is minimally tested, or just can't be easily tested. Like the browser example.

One thing we could do is use git blame. Git blame basically allows us to know who and which commit added what to which file.

git blame shows you which commits contributed to the final outcome of a file.

This is indeed useful in many situations, but in the course of development we find that sometimes we can observe the behavior changing, but we don't know what code caused the behavior change. In this case git blame will do us little good.

In more advanced cases, specifically when these two conditions are met:

You can identify the functionality that is broken.

You can not identify the code that was changed in the past that broke the functionality.

What we could do is blindly revert back to a previous commit, not knowing how far back it goes, one by one, until we uncover where the commit is that causes the regression to disappear.
Though consider this, in large projects where commits are small you will have many. Not a handful. You may have hundreds or even thousands of commits to sift through! Your database has to be migrated to match the new commit each time. There may be other setup required, like old services and externals that your software relied on. You want to reduce your O(n) search down. The best we're going to get here is a binary search. O(log n)

So if we implemented the search ourselves we would need to divide the search space in half and put one half on a different branch from the other half. Then continue to divide each branch into smaller branches and repeat until there's a single commit at the end of the tree. Then we'd follow the tree either down the left or right side depending on if we can see the functionality changes. For example, we go to commit #50 of 100. We run our app and we see that the functionality is fixed. So now we know the commit is greater than #50. We continue to #75 and test again. We see the functionality is buggy again at this point. We know now it's between #50 and #75. We could repeat until we find the commit which will show us exactly the causative link. Continue reading for a better solution.

Because this is exactly what git bisect is for. It automates the process by checking out specific commits and performing a binary search on your commit history. To use git bisect first use git bisect start

$ git bisect start

Next we should specify the most recent commit where the functionality was bad with git bisect bad

$ git bisect bad

git bisect bad should generally be used on the head of the branch the issue is on

Now mark the earliest commit known to not have the software defect with git bisect good [SHA]

You can also use tags and if you've been tagging your software correctly could specify the version number, or milestone without needing to look up the SHA

git bisect gives us information on how many more times it estimates it'll need to test using the binary search, and which commit is currently being tested. Each step along the way you'll need to either issue a git bisect good or git bisect bad to specify that the problem either is resolved at this commit or it's still there. After issuing this command for the last time you'll see this:

$ git bisect good
d0fd1a64162b4bcc49f5f8d0e81f73a4da3b736f is the first bad commit

That's all there is to it. We now know exactly which commit caused the issue. We can examine the code there and we know the causative change that caused the defect in our software. Oh, and you'll need to issue git bisect reset after you're finished.