Playing with Go: Embarrassingly Parallel Scripts

I recently needed to take a list of domain and find which ones point to a specific IP address. For a small list, say less than 10, manually running dig in the console would work great, but this list had almost 800 domains so I needed a script. As domain lookup is a network request and thus very slow, setting up the domain requests in parallel made sense. I could easily just do this in Ruby, my language du-jour, but I’ve done this type of thread work before and frankly it can be tedious to set up, fragile, and still won’t have access to all of my system’s resources due to the GVL1. I’ve been keeping an eye on Google’s Go for some time now and decided to see how it handled this problem.

I’ve been intrigued by Go since it was originally announced about three years ago. Here was a compiled, fast, light-weight, low level language with many of the features we take for granted these days, such as garbage collection, while also adding on a very sophisticated concurrency model similar to what’s found in Erlang: very lightweight internal processes managed by the runtime. Sounds like a perfect fit for my requirements.

The code I ended up with is here: https://gist.github.com/4170926. For the sake of comparisons I built a sequential version of the script as well as the parallel version and added timings for running both scripts against the full list of domains.

Running these scripts for yourself is a one-liner: go run [script.go]. The input file domains.txt needs to be a newline-delimited list of domains. I’ll go over the more confusing parts of the two scripts to help with understanding what’s really going on here.

Objects?

Go’s object model is very close to C’s: structs with data and methods that operate on said structs. Both scripts only use a small, two-element struct, DomainMap, to keep track of the IP address found for a given domain. I use the short-form to initialization new instances of the DomainMap structure. The order of values maps directly to the order of the defined fields at the top of the scripts.

Error handling

Go does error handling by returning multiple values from a function, where the second return value is expected to be a value of type error. You can ignore this with the _ variable.

rawIpAddresses, _ := net.LookupIP(domain)

Parallelism

The parallel version of the script has some new concepts that need explaining, particularly goroutines, channels, and channel communication.

A goroutine is a very lightweight process, sort of like a Ruby Fiber. Creating one is simple:

go domainLookup(responseChannel, domain)

Go will grab the function call after the go keyword and execute it in parallel. However, given that we’re no longer in the main process, we can’t just return values from the function. We now need a different way to get the return value. This is where channels come in.

responseChannel := make(chan DomainMap)

As Go is a statically typed language, we need to define the type of channel being created. Channels can only accept data of the same type as the channel. Communication through channels is done with the reverse-stabby operator <-, which should be read as “the data on the right side is flowing to the left side”:

// Write into a channel
returnChannel <- DomainMap{domain, ipAddress}

// Read from the channel
domainMap := <- responseChannel

And that’s all the special syntax. The only real difference between the parallel and sequential scripts is the map-reduce-esque setup to wait for all the goroutines to finish. I didn’t need to worry about thread pooling, system capabilities, or thread safety. Go makes it so easy to write truly parallel code that there’s no excuse not to anymore. I was able to run almost 800 goroutines (one per domain) all throwing out DNS queries and coming back in less than 10 seconds, in a script that doesn’t even look like it’s running in parallel.

Now that Go 1.0 stable is out, it’s a great time to get familiar with this language. I highly recommend checking out the Tour of Go for basic introductions into every major feature of the language, and there’s a ton of documentation on the main website golang.org. For the little bit of time I’ve played with Go now, I see a very bright future for this language.

@Carlos: As this is a one-off script, re-running the script is good enough error handling for me. This would definitely be far different if it was a module run inside of a bigger application.

MarcalcDecember 03, 2012

I know you wanted to use GO for this article, but you know you could have considered JRuby for this work, right?

jnmlDecember 03, 2012

Please use gofmt whenever publishing Go source code.

kikitoDecember 03, 2012

Thanks for your blog post, it was very instructive.

One question: what does this mean?

domainMapping = append(domainMapping, on that line.

kikitoDecember 03, 2012

It seems that the code was mingled by a html strip tags, but never mind, I think I figured it out. When you have a LEFTARROW channel in a param, you are just using whatever that channel returns next as the param. I assume that this is a blocking call.

Append is a built-in function to work on the slice data type, and it always returns the modified slice because this call might resize the one you passed in or a new slice might be allocated depending on the capacity of said slice.

Also yes [left arrow] is a blocking call.

Job van der ZwanDecember 03, 2012

Another nitpick: it’s not really parallel – it’s concurrent. For now, Go is single-core unless you explicitly tell it to use multiple cores1. As far as I can tell, your code is running single-core. Which actually makes this a nice example of how concurrency can be faster regardless!

Here’s a list of domains I found https://raw.github.com/tarr11/Webmail-Domains/master/domains.txt

johnDecember 03, 2012

I tired this under windows 7 and both scripts run the same…I have a Core 2 Duo. I also: set GOMAXPROCS=2

thx!

NicoDecember 04, 2012

Go’s approach to parallelism reminds me of something… ah! Unix and its shells. It’s very easy to parallelize shell scripts too… And go channels look remarkably like pipes. Of course, the shells are kinda sucky and outdated, so yes, Go is better.

AnthonyDecember 04, 2012

Echoing a previous comment – If you were itching to give Go a try, that’s one thing, but saying that you couldn’t do it in Ruby because of the GVL is fallacious and misleading.
You could easily have used JRuby and get an industrial-strength Ruby implementation without a GVL.

@Anthony: I never said I couldn’t do it in Ruby. What I said was that Ruby’s GVL ensures that you won’t get full use of your system when trying to build concurrent systems. Yes you can switch to JRuby but then you’re not using Ruby, you’re using JRuby, and I wanted to branch out and try something completely outside of the Ruby ecosystem.

@Job van der Zwan: Right, thanks for pointing that out! Had only glanced at some of that previously, I’ll be sure to remember that setting in the future.

@Jason: In MRI 1.9, threads blocked by IO will run in parallel. You don’t need JRuby to run a bunch of network requests on all your cores. I understand you just wanted to use Go, but please understand that the GVL doesn’t necessarily block ruby threads from running in parallel.

I wrote an example showing Ruby 1.9.3 on a Macbook Pro with a Core i7 resolving 800 random-ish hostnames: https://gist.github.com/ec353d84522531fe2bfa

As you can see, it takes about 16 seconds, but the point is that the requests run in parallel on MRI with nothing but Thread.new.

Don’t get me wrong, I think it’s great that you found a simple but practical example to introduce Go’s concurrency primitives, and I appreciate the time you took to write up this blog post. Kudos! I just find that there is a lot of confusion about concurrency when it comes to MRI’s thread implementation, and I think it’s a shame that Rubyists don’t realize they can parallelize IO-bound tasks.

@benolee: If anything I shouldn’t have mentioned Ruby’s GVL at all, as that ended up distracting from the point I was trying to make which was to show my playing with concurrency in Go. Doing anything IO bound is of course a very easily parallellizable task for any language, which puts us back in the realm of how hard it is to put together a good example. I never meant to say “Ruby sucks. What does this better?”, but “I’ve done this in Ruby and I want to try another language now!” and talking about my experiments.