beanstalkd, PHP workers, fires

99designs pushes four million background jobs through beanstalkd each day.
beanstalkd is a fantastic job queue which we’ve used for more than five
years, via the pheanstalk client which I wrote in 2008.

Each beanstalkd job has a TTR; a
timer which counts down during job processing. If TTR seconds elapse before
the worker finishes the job, beanstalkd assumes the worker is dead and releases
the job. Another one of our worker takes the job, despite the original worker
still churning away. Each iteration of this results in greater load and less
chance of this or any other job finishing. Eventually all the worker processes
are stuck, and everything literally catches fire.

That’s what happened when we began pushing ImageMagick and GhostScript jobs
to rasterize graphics. Some pathological EPS files took longer than the 600
second TTR, causing worker resource starvation.

Increasing the TTR would mitigate the issue, but these EPS files seem subject
to the halting problem. That leaves workers vulnerable to slow job
saturation.

Interrupting the image operation when the job hits its TTR would be a better
solution. But workers need concurrency to watch the TTR during the job. PHP
doesn’t do threads, except via an extension that I’m disinclined
to use. Using fork() would introduce IPC / signal handling
complexity, and prevent processes sharing the beanstalkd connection. PHP feels
like the wrong language to attack the problem.

cmdstalk

Lachlan and I decided we could kill N birds with a single stone. One:
solve the queue fires. Two: move another piece of our production
infrastructure to Go. Three: provide a beanstalkd layer which our PHP, Ruby
and Go apps could all use.

cmdstalk set out to harness the beanstalkd semantics we like on one end, and
talk standard unix processes on the other. This allows us to write workers in
any language. Here’s the basic model:

Connect to a beanstalkd server, watch one
or more tubes.

Pipe each job payload to a command specified by cmdstalk --cmd=… argument.

If the subprocess exits 0, delete the job; done.

If the subprocess exits non-zero, release the job for retry (with backoff).

If TTR elapses, kill the subprocess and bury the bad job.

Anything that can read stdin and exit(int) can be a cmdstalk worker — no
need for beanstalkd knowledge.

Go

Go has become the go-to language at 99designs for
infrastructure components. My only previous Go experience comes from
writing go6502, an 8-bit computer emulator. Fascinating, but
different to writing concurrent network applications. Despite that, building
cmdstalk with Go was a pleasure.

Starting from the cmdstalk entrypoint you’ll see broker and
cli packages loaded. cli/options.go demonstrates Go’s flag library for
argument parsing. broker_dispatcher.go
coordinates broker concurrency across tubes, and broker.go
is where the action happens. Broker.Run() is a clear candidate for
refactoring, but when workers are burning, software’s better shipped than
perfect.

Commit ade6f6b0 introduces a simple -all flag to watch all
tubes at start-up. 431ac5fc evolves it to poll for new tubes as
they’re created. The latter illustrates how well timers and concurrency come
together in Go. Together they show that it’s simple to add functionality that
would be complex in other languages.

Tests live alongside the code they’re testing, such as
broker_test.go alongside
broker.go. They’re regular Go code using if to make
assertions, but richer assertion libraries do exist.

Conclusion

cmdstalk applies a unix-process abstraction layer to beanstalkd job processing.
Like any abstraction it needs to make itself worthwhile.

If you’re running Rails, you might want to look at Sidekiq or
Resque, or maybe even delayed_job. If you’re 100% python, you
could wire together some solid libraries for job processing.

But if you need to process background jobs using several languages, some of
them poorly suited to long-running daemons and concurrency, cmdstalk may be for
you. Give it a try; feedback and pull requests are welcome.

]]>2014-04-22T00:00:00+10:00http://99designs.com/tech-blog/blog/2014/04/22/Swiftly-Machine-Learning-2In this series of guest blog posts, 99designs intern Daniel Williams takes us through how he has applied his knowledge of Machine Learning to the challenge of classifying Swiftly tasks based on what what customer requests.

The challenge

Swiftly is an online service from 99designs that lets customers get small graphic design jobs done quickly and affordably. It’s powered by a global network of professional designers who tackle things like business card updates and photo retouching in 30 minutes or less – an amazing turnaround time for a service with real people in the loop!

With time vital to the service value, any moment wasted in allocating a task to a designer with experience in the specific requirements could have a detrimental impact on the customers experience.

With the ultimate aim of complete and accurate automation of job to designer matching with the customer simply saying in their own terms what they need, we decided to apply machine learning to further develop Swiftly’s “Intelligent Matching System”.

This is part two of a three-part blog series. In part one we tried to determine the types of tasks. In this post, we use machine learning to classify tasks into these task categories. A future post will discuss using our predictions for task allocation.

Categories to predict

To set up a machine learning problem, we need to first decide on what we want the answers to be.
After the last post’s experimentation, I decided to split the classification into two parts: what type of document is to be edited or created, and what type of work is needed on the document.

This gives us 7 document types:

Logo

Business Card

Icon

Template (ppt / pdf / word etc)

Header / Banner / Ad / Poster

Social Media

Other Image

and 9 types of graphic design work appropriate for small tasks:

Vectorisation

Transparency

Holidays edit

Creative Update

Resize

Reformat

General Edit

Colour Change

Text Change

For example, one task might be Vectorisation on a Logo, another might be Text Change on a Business Card. In total, 63 different combinations of document and work type exist. This is what we’re trying to predict.

Obtaining training data

In my last post, I used unsupervised techniques that don’t need training data. Now that we have a specific outcome we’d like to predict, supervised methods are more appropriate. They use training data find patterns associated with each category, patterns that might be hard for humans to spot. For us, that training data will be a bunch of historical tasks and the correct categories for them.

However, obtaining good training data is a large problem in itself, especially given how many combinations of categories there are!

Mechanical Turk

Knowing how much work was involved, my first instinct was to outsource it to Amazon’s Mechanical Turk service. Mechanical Turk is named after an elaborate 18th century hoax that was exhibited across Europe, in which an automaton could play a strong game of chess against a human opponent. It was a hoax because it was not an automaton at all: there was a human chess player concealed inside the machine, secretly operating it.

Amazon calls its service Artificial Artificial Intelligence, and it is a form of ‘fake’ machine learning. We use software to submit tasks for classification, but real people all over the world get paid a little money to do the categorising for us.

Manual Classification

Unfortunately, the results I achieved from Mechanical Turk were poor. Even humans incorrectly classified many tasks, and this data, if fed into my machine learning classifier, would lead it to poor conclusions and low accuracy. The Turkers may have lacked some specialised knowledge about graphic design, or I may not have set up the Mechanical Turk task sufficiently well. (I wish I had read this post before diving into Mechanical Turk!)

Ultimately, having an accurate training set is perhaps the most important part of developing a good classifier. I rolled up my sleeves, and manually inspected and classified approximately 1200 Swiftly design briefs myself. This was slow and monotonous, but it meant that I knew I had an excellent quality training set.

Pre-processing Pipeline

Our classifier doesn’t accept raw text, but instead we must turn design briefs into features it can make decisions on. Human language is complicated, so there are many steps to go from text to features. Any good natural language system has such a pipeline. In ours, we:

Tokenise: split the text up into individual ‘words’

Remove punctuation and casing

Remove stop words (common words with no predictive power such as ‘a’, ‘the’)

The first four steps we covered in the last post, let’s go over steps 5 and 6 here.

Lemmatisation

Lemmatisation is similar to stemming. It’s the process of grouping related words together by replacing several variations with a common shared symbol. For example, Swiftly task descriptions often contain URLs. Lemmatisation of URLs would mean replacing every URL with a common placeholder (for example “$URL”). So the following brief:

On this business card, please change “www.coolguynumber1.com” to “www.greatestdude.org”

becomes:

On this business card, please change “$URL” to “$URL”

We do this because the number of words that occur in the data set is large, but many only occur once or twice. Nearly every URL we see in a brief will be unique. For our machine learner, it can only say something useful about words which are shared between different tasks, so all these unique words and URLs are wasted.

We do this because pre-processing involves generating a list of all the words that appear in the training dataset. However, words that only appear once in the dataset are removed because they add noise. URLs are generally unique and are unlikely to occur more than once. Without lemmatisation, we lose all information gained from the presence of URLs in a brief. With lemmatisation, we instead get the symbol “$URL” many times. If a URL in a task description turns out to be a discriminating feature, this should increase classification accuracy.

Please change the email on this business card from coolguy99@99designs.com to koolguy99@99designs.com. Can you also include a link to my website www.coolestguyuknow.net on the bottom? Please also change all the fonts to #CC3399 and the circle to #4C3F99. I want a few different business card sizes, namely: 400 x 400, 30 x 45 and 5600 by 3320. Thanks!

to:

Please change the email on this business card from $EMAIL to $EMAIL. Can you also include a link to my website $URL on the bottom? Please also change all the fonts to $CHEX and the circle to $CHEX. I want a few different business card sizes, namely: $DIM, $DIM and $DIM. Thanks!

Now URLs, email addresses, dimensions and so on can all take many different forms. The easiest way to match as many as possible is to use regular expressions. I used these patterns to perform my lemmatisation (for Python’s re module), you might find them useful too.

Bigrams

Previously I had worked with each word in the text individually (“unigrams”), but this often means words have no context. So, for example, “business card” was broken into “business” and “card”, and the importance of those words appearing together was lost. Bigrams are simply pairs of words that appear next to each other. So, if we include both unigrams and bigrams, the text “business card” would provide us the features “business”, “card” and “business card”. This captures more of the context of certain phrases. In our data, the top bigrams after stemming were:

bigram

frequency

would like

72

logo add

49

take exist

47

fun creativ

47

add fun

44

exist logo

44

busi card

33

transpar background

28

$URL $URL

24

creativ festiv

22

festiv element

22

bat pumpkin

21

spooki element

21

pumpkin skeleton

21

busi name

20

name logo

20

The pipeline in action

Let’s do a worked example using the sentence below:

Please change the email on this business card from coolguy99@gmail.com to koolguy99@gmail.com. Thanks!

Our pipeline first tokenises the sentence into words. Follow each word from left to right in the table below to see how it gets transformed by the pipeline.

Vectorisation

As discussed in the last post, we need to convert text into a numerical format. I used a simple model known as the bag-of-words vector space model. This model represents each document as a vector, a count of how many time each different word occurred in it. The vector will have n dimensions, where n is the total number of terms in the whole collection of documents. In the training dataset, there are 9186 tokens. Each brief is sparse – the vast majority of terms will have a count of 0.

Once the data set has been converted into vectors, it can be used to train a supervised learning algorithm.

Supervised Learning: Training the Classifier

Now that our data’s in the desired format, we can finally develop as system that learns to tell the difference between the various categories. This is called building a classifier model. Once the model has been built, new briefs can be fed into it and it will predict their category (called their label).

What we’ve discussed so far is getting labels and extracting features using our pipeline. But what algorithm should we use?

Multinomial Naive Bayes

I have chose to use the Multinomial Naive Bayes (“MNB”) classifier for this task. The Naive Bayes Wikipedia page does a good job of explaining the mathematics behind the classifier in detail. Suffice to say that it is simple, computationally efficient and has been shown to work surprisingly well in the field of document classification.

A (simplified) worked Example

A simplified way of thinking about how the algorithm works in the context of document classification is:

For each token in the total training dataset, what is the probability of that token being associated with each class?

For each token in a particular brief, add up the probabilities of each class for each token

pick the class with highest probability.

So, say we have the following probabilities (after laplacian smoothing and normalisation) for the tokens from our earlier example occurring in each category type:

Token Name

Other Image

Header / Banner / Ad/ Poster /Flier

Logo

Business Card

Template work (ppt / pdf /word etc)

Icon

Social Media

card

0.00019

0.00322

0.00257

0.0155

0.00021

0.00055

9e-05

busi

0.00038

0.00154

0.00325

0.00915

0.00021

0.00048

0.00037

busi card

6e-05

0.00055

0.00174

0.00904

0.00021

0.00048

9e-05

chang

0.00275

0.00445

0.00416

0.00525

0.00064

0.00159

0.00028

file

0.00596

0.00395

0.00649

0.00525

0.00245

0.00408

0.00241

logo

0.00096

0.0054

0.0266

0.00513

0.00075

0.00512

0.00408

need

0.00832

0.00672

0.00717

0.00478

0.00139

0.00623

0.00232

attach

0.00467

0.00622

0.00364

0.00414

0.0017

0.00484

0.0012

updat

0.00013

0.00104

0.00079

0.00391

0.00032

0.00042

9e-05

$EMAIL

0.00019

0.00073

0.0002

0.00373

0.00032

0.00014

0.00028

Given the the brief:

update the logo on my business card

We would match up each token with it’s probabilities in the table above, giving us the following table. Adding up each column would then give us a score for that class.

Token name

Other Image

Header / Banner / Ad / Poster / Flier

Logo

Business Card

Template work (ppt / pdf /word etc)

Icon

Social Media

card

0.00019

0.00322

0.00257

0.0155

0.00021

0.00055

9e-05

busi

0.00038

0.00154

0.00325

0.00915

0.00021

0.00048

0.00037

busi card

6e-05

0.00055

0.00174

0.00904

0.00021

0.00048

9e-05

logo

0.00096

0.0054

0.0266

0.00513

0.00075

0.00512

0.00408

updat

0.00013

0.00104

0.00079

0.00391

0.00032

0.00042

9e-05

sum:

0.00172

0.0118

0.035

0.0427

0.0017

0.00705

0.00472

Business card has the highest score, and so that is our prediction. Simple! The mathematics is a little more sophisticated than this, but the intuition behind it is the same.

Classifier Structure

Now, we have two types of classes to predict, document type and task type. I decided to build the machine learning classifier structure reflect this. A top level classifier which predicts the document type (logo, business card, etc), trained using the full dataset. Then we have a separate specialised classifier for each document type which will predict the task category. So, we will have a classifier just for working out the task type for business card cases, trained only on those cases.

The training and classification is summarised in these handy diagrams.

Classifier Training

Classification

Results

Are we getting good predictions?

To see whether our algorithm is, in fact, learning with experience, we can plot a learning curve. This tells us both how the classifier is doing, and how helpful more data would be. To test this, I plotted the 10-fold cross-validated accuracy of the top-layer classifier as the training set size is increased:

It looks like our machine is learning! The more data it sees, the better it gets at picking out the correct category. It looks as though accuracy may flatten off at about 80%. This suggests that to do better, we’d need to find new features instead of just collecting more cases. The sub-classifiers, as a result of the classifier structure, have less data to work with in the training set. However, they appeared to follow a similar learning curve.

Accuracy of various implementations

Over the course of my experiments, I tested the accuracy of a variety of implementation and algorithms. For those interested in the details, accuracy figures are below.

Classifier Type / Algorithm Type

MNB

NB

Baseline

Specialised Sub-Classifier:

Top Level Classifier

78.62 %

60.17 %

36.33 %

Sub-Classifier

69.46 %

61.54 %

32.97 %

Combined accuracy

54.61 %

37.03 %

11.97 %

Generalised Sub-Classifier:

Top Level Classifier

78.62 %

60.17 %

36.33 %

Sub-Classifier

59.97 %

50.95 %

24.13 %

Combined accuracy

47.15 %

30.66 %

8.77 %

Single Classifier:

Accuracy

45.58 %

39.12 %

11.43 %

The “Specialised Sub-Classifier” is the implementation we discussed above, whereas the “Generalised Sub-Classifier” used a single classifier to look at task type, rather than one per document type. The “Single Classifier” tries to hit both targets at once, classifying against the full set of 63 category combinations. I also compared multinomial naive bayes against naive bayes (NB) and a simple Zero-R baseline.

Wrapping up

The two-tier classifier approach worked the best, picking the document type correctly nearly 80% of the time, but getting both document and task type right only 55% of the time. The Multinomial Naive Bayes also did better than Naive Bayes on this task, as expected.

Next Time

Next time, I will be discussing the how this system can be applied to assist with the next stage of the customer to designer matching process. How do we figure out which categories a particular designer may be good at? And how do we make sure that designer gets those tasks?

About Daniel

Daniel Williams is a Bachelor of Science (Computing and Software Science) student at the University of Melbourne and Research Assistant at the Centre for Neural Engineering where he applies Machine Learning techniques to the search for genetic indicators of Schizophrenia. He also serves as a tutor at the Department of Computing and Information Systems. Daniel was one of four students selected to take part in the inaugural round of Tin Alley Beta summer internships and he now works part-time at 99designs. Daniel is an avid eurogamer, follower of “the cricket”, and hearty enjoyer of the pub.

]]>2014-03-31T11:57:00+11:00http://99designs.com/tech-blog/blog/2014/03/31/hiring-a-rails-developerby John Barton

Here at 99designs we’re what you’d call a polygot shop – we’ve got a mix of PHP, Ruby, Python, and Go in production. When we say production, we mean at serious scale. Our mission is to connect the world with great graphics designers wherever they are, something which we do quite a bit of.

Right now we’re on a hunt for a developer who can Help Us Out™. Usually we advertise for a generalist “web developer” and then find the right place for them internally based on their strengths. This time we’re trying to hire a very specific skill set for a very specific project. The skills are Ruby and Rails, and the project is building out our new payments service.

Company wide we’re transitioning to having small, decentralised teams with their own product lines and the attendant SOA/Platform to support that goal. Last year we had great success with creating our single sign-on system in Go, and this year we’re rounding out the platform with a shared payments system in Rails*.

This new service will enable us to spin up new product lines or move into new international markets quickly. Between the iterative approach we’re taking to replace our old payments system and the UX for both the customers using the service and the developers integrating it there are some exciting and interesting problems to solve on this project.

The existing team on project are very strong developers with good knowledge of the problem space but not a lot of Rails experience. We need a mid to senior developer to come in and help “set the tone” of the codebase. That role had been filled within the team by me (John Barton, internet famous as “angry webscale ruby guy”), but I’ve since been promoted to manage the engineering team as a whole and between all the meetings and spreadsheets it’s hard to keep up the pace of contribution that this project deserves.

You’ll need to be the diesel engine of the team: churn through the backlog turning features into idiomatic and reliable Rails code at a steady cadence. There are opportunities to coach within the team, but even just creating a sizeable body of code to be an example of “this is how we do it” (cue Montell Jordan playing https://www.youtube.com/watch?v=0hiUuL5uTKc) will keep this project on track.

The quality of the codebase after 3 months of progress is high. We don’t believe in magic make-believe numbers here, but right now we’re sitting on a code climate GPA of 4.0. If you’re a fan of Sandi Metz’s Practical Object Oriented Design in Ruby or Avdi Grimm’s Objects on Rails you will feel right at home in this codebase.

If this is something you’re interested in and think you can help us out with, check out the job ad

*You may be wondering “why not go?” for this system. The short answer is that there’s enough complexity in the business rules that the expressiveness of Ruby is very useful, and being a financial project moving numbers around in a database is very important and ActiveRecord is more mature that any of the ORMs available in Go right now. I’m happy to elaborate on our line of thinking during your interview ;-)

At 99designs we heavily (ab)use Varnish to make our app super fast, but also to
do common, simple tasks without having to invoke our heavy-by-contrast PHP
stack. As a result, our Varnish config is pretty involved, containing more than
1000 lines of VCL, and a non-trivial amount of embedded C.

When we started seeing regular segfaults, it was a pretty safe assumption that one of
us had goofed writing C code. So how do you track down a transient segfault in a system like Varnish? Join us down the rabbit hole…

Get a core dump

The first step is to modify your production environment to provide you with
useful core dumps. There are a few steps in this:

First of all, configure the kernel to provide core dumps by setting a few sysctls:

Creates a place to store cores on AWS’s ephemeral storage (if like us you’re on EC2)

Tells the kernel to write core files out there

With this done, and no known way to trigger the bug, play the waiting game.

When varnish explodes, it’s show time. Copy the core file, along with the shared
object that varnish emits from compiling the VCL (Located in
/var/lib/varnish/$HOSTNAME) over to a development instance and let the debugging begin.

Locate the crash point

If you have access to the excellent LLDB from the LLVM project, use that. In our case, getting it to
work on Ubuntu 12.04 involves upgrading half the system, resulting in an
environment too dissimilar to production.

If you spend a lot of time in a debugger, you’ll probably want to use a helper
like fG!’s gdbinit or voltron to make your life
easier. I use voltron, but because of some of the clumsiness in gdb’s API,
immediately ran into somebugs.

Finally, debugging environment working, it’s time to dig into the crash. Your situation is going to be different to ours, but here’s how we went about debugging a problem like this recently:

Debugging the core dump with voltron

As you can see in the [code] pane, the faulting instruction is mov
0x0(%rbp),%r14, trying to load the value pointed to by RBP into r14.
Looking in the register view we see that RBP is NULL.

Inspecting the source, we see that the faulting routine is inlined, and that
the compiler has hijacked RBP (The base pointer for the current stack frame) to
use as argument storage for the inline routine

We now know that the fault is caused by a mapelm struct with a bits member
set to zero; but why are we getting passed this broken struct with garbage in
it?

Digging in deeper

Since this function is declared inline, it’s actually folded into the calling
frame. The only reason it actually appears as in the backtrace is because the
callsite is present in the DWARF debugging data.

We can poke at the value by inferring its location from the upstream assembly,
but it’s easier to jump into the next upstream frame and inspect that:

The code is trying to generate a pointer to this arena run structure, using the
number of bits in the mapelm struct, AND against the inverse pagesize_mask to
locate the start of a page. Because bits is zero, this is the start of the
zero page; a NULL pointer.

This is enough to see how it’s crashing, but doesn’t give us much insight for why. Let’s go digging.

Looking back at the code snippit, we see an assertion that the arena_run_t
structure’s magic member is correct, so with that known we can go looking for
other structures in memory. A quick grep turns up:

1

./lib/libjemalloc/malloc.c:#defineARENA_RUN_MAGIC0x384adf93

pagesize_mask is just the page size -1, meaning that any address bitwise AND
against the inverse of the pagesize_mask will give you the address at the
beginning of that page.

We can therefore just search every writable page in memory for the magic number
at the correct offset.

The magic number and magic member of the struct (Conveniently located as the
first 4 bytes of each page) only exists if we’ve got a debug build.

Aside: can we abuse LD_PRELOAD for profit?

At this point all signs point to either a double free in varnish’s thread pool implementation, leading to an empty bucket (bits == 0), or a bug in its memory allocation library jemalloc.

In theory, it should be pretty easy to rule out jemalloc, by swapping in another malloc library implementation. We could do that by putting, say tcmalloc, in front of its symbol resolution path using LD_PRELOAD:

We’ll add:

1

export LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.0

to /etc/varnish/default and bounce varnish. Then move all the old core files out of the way, wait (and benchmark!)

However, there’s a flaw in our plan. Older versions of varnish
(remember that we’re on an LTS distribution of Ubuntu) vendor in a copy of
jemalloc and statically link it, meaning that the symbols free and malloc
are resolved at compile time, not runtime. This means no easy preload hacks for us.

Rebuilding Varnish

The easy solution won’t work, so let’s do the awkward one: rebuild varnish!

1

apt-get source varnish

Grab a copy of the varnish source, and link it against tcmalloc. Before that
though, I deleted lib/libjemalloc and used grep to remove every reference to
jemalloc from the codebase (Which was basically just some changes to the
configure script and makefiles)

and then add -ltcmalloc_minimal to CFLAGS before building. As an aside, the
ubuntu packages for tcmalloc ship /usr/lib/libtcmalloc_minimal.so.0 but not
/usr/lib/libtcmalloc_minimal.so, which means the linker can’t find them. I
had to manually create a symlink.

With this new varnish in production, we haven’t yet seen the same crash, so it
appears that it was a bug in jemalloc, probably a nasty interaction between
libpthread and libjemalloc (The crash was consistently inside thread
initialization).

Try it yourself?

Let’s hope not. But if you do a lot of Varnish hacking with custom extensions, occasional C bugs are to be expected. This post walked you through a tricky Varnish bug, giving you an idea of the tools and tricks around debugging similar hairy segfaults.

If you’re messing around with voltron, you might find my voltron config and the tmux script I use to setup my environment a useful starting point.

In this series of guest blog posts, 99designs intern Daniel Williams takes us through how he has applied his knowledge of Machine Learning to the problem of classifying Swiftly tasks.

Introduction

Swiftly is an online service from 99designs that lets customers get small graphic design jobs done quickly and affordably. It’s powered by a global network of professional designers who tackle things like business card updates and photo retouching in 30 minutes or less – an amazing turnaround time for a service with real people in the loop!

Given that we have a pool of designers waiting for customer work, how can we best allocate them tasks? Currently we take a naive but fair approach: assign each new task to the designer that has been waiting in the queue the longest. But there’s room for improvement: designers excel at different types of tasks, so ideally we’d match tasks to designers based on expertise. To do this we need to be able to categorise tasks by the skills they require.

In today’s approach, we’ll try to solve the problem with machine learning. The first step is to find a way to automatically categorise a design brief, with categories forming our “areas of expertise”. The next will be figuring out what categories a particular designer is good at. If we can build solid methods for both these two steps, we can begin matching designers to tasks.

In this post, I’ll introduce the problem and walk through some attempts at applying unsupervised techniques for discovering task categories. Follow along, and you may recognise a similar situation of your own that you can apply these methods to.

Swiftly tasks

Swiftly tasks are meant to be quick to fire off and highly flexible. The customer fills in a short text box saying what they want done, uploads an image or two, and then waits for the result. This type of description, plain text and raw images, is highly unstructured. Since image recognition and indexing is its own hard problem, we’ll skip the images for now and focus on the text.

Here’s a couple of examples:

Task A

Remove the man’s glasses.

Make the man’s face MORE HANDSOME.

Task B

In my logo, there is a “virtual” flight path of an airplane. I have had comments that the virtual flight path goes into the middle of the Pacific Ocean for no reason - not a logical graphic. I want you to “straighten” out the flight path - as shown on the Blue lines in the attached PDF titled “Modified_Logo.PDF.” I still want the flight path lines to be in white, with black triangles separating the segments. I just want the segments to be straighter and not go over the ocean as in the original. Please contact me for any clarification. I am uploading the EPS and AI files as well to make the change. Thank you!

How might a human classify these tasks? I would probably classify the first as “image manipulation” and the second as “logo refresh,” although the second could just as easily also be “image manipulation” as well. Already you can see that classifying these sorts of tasks into concrete categories is perhaps going to be more art than science.

Figuring out the categories

The first major problem is deciding on a sensible set of categories. This has turned out to be more difficult than I first imagined. Customers use Swiftly for a wide range of tasks. Plus, there’s quite a bit of overlap — one Swiftly task is sometimes a combination of multiple small tasks. My initial approach, just to get a feel for the data, was to eyeball 100 task briefs and attempt to invent categories and classify them manually. The result of this process:

Category

Number of Tasks

Logo Refresh (Holidays)

34

Logo Refresh

11

Copy Change

11

Vectorise

13

Resize/Reformat

17

Transparency

1

Image Manipulation

10

Too hard to classify

3

A large number of the instances were hard to classify, even for a human! I was not 100% happy with the categories that I came up with, with many tasks not fitting comfortably in the buckets. I decided to apply some unsupervised machine learning techniques in any attempt to cluster design briefs into logical groups. Can a machine do better?

Unsupervised clustering

I explored software called gensim, an unsupervised natural language processing and topic modelling library for Python. Gensim comes equipped with various powerful topic modeling algorithms, which are capable of extracting a pre-specified number of topics and associating words with those topics. It also helps with converting a corpus of documents into various formats (e.g. vector space model). The main algorithm that I made use of is called Latent Dirichlet Allocation. The first step is converting the text corpus into a model that allows for the application of mathematical operations.

The vector space model

To apply mathematical-based algorithms to natural language, we need to convert language into a mathematical format. I used a simple model known as the bag-of-words vector space model. This model represents each document as a vector, where each dimension of the vector corresponds to a different word. The value of a word in a particular document is just the number of times it appeared in that particular document. The vector will have n dimensions, where n is the total number of terms in the whole collection of documents. Let’s try an example.

Say we have the following collection of documents:

The monster looked like a very large bird.

The large bird laid very large eggs.

The monster’s name was “eggs.”

After finding all the unique words (“the,” “monster”, etc.) and assigning them an index in the vector, we can count those words in each document to turn each document into a word frequency vector:

(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)

(1, 0, 0, 0, 0, 1, 2, 1, 1, 1, 0, 0)

(1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)

Corpus pre-processing

If you just split your text into words on whitespace and apply this naively, the results can be messy. On the one hand text contains punctuation we want to ignore. On the other, this is going to work best when we have lots of words in common between the documents. Do we really want to treat “Egg”, “egg” and “eggs” as different words? To get the best results, you deal with these kinds of problems in a pre-processing step.

Stemming is the process where words are reduced to their “stem” or root format, basically chopping any variation off their end. For example, the words “stemmer,” “stemming” and “stemmed” would all be reduced to just “stem”. I used the nltk implementation of the snowball stemmer to perform this step. All of these steps can be performed very easily in Python:

This process reduces the noise in the vector space model, because tokens that mean the same thing are assigned the same token (through stemming and punctuation and caps normalisation) and words that probably do not add any meaning are removed (through stop word removal). Eventually, I expected the pre-processing steps to be much more in depth, but for now this should get us started.

Latent Dirichlet Allocation (LDA)

LDA is an algorithm developed to automatically discover topics contained within a text corpus. Gensim uses an “online” implementation of LDA, which means that it breaks the documents into chunks and regularly updates the LDA model (as opposed to batch which processes the whole corpus at once). It is a generative probabilistic model that uses Bayesian probabilities to assign probabilities that each document in the corpus belongs to a topic. Importantly, the number of topics must be supplied in advance. Since I did not known how many topics might exist, I decided to apply LDA with varying numbers of topics. For example, if we did an LDA with 5 topics, the result for a single document might look like this:

1

[(0,0.0208),(1,0.549),(2,0.0208),(3,0.366),(4,0.0208),(5,0.0208)]

Which means LDA places that document 2% in topic 0, 55% in topic 1, 20% in topic 2 and so on. For the simple analysis I am doing, I just want the best guess topic. We can convert the result from probabilistic to deterministic by just picking the best guess.

1

max(x,key=lambdalda_result:lda_result[1])

Much of my approach in the following segments is based on Gensim’s author’s LDA guides.

Pre-processing for LDA

I extracted ~4400 job descriptions from the Swiftly database. I removed formatting of each, and applied the pre-processing steps described above (tokenisation, stemming, stop word removal etc.). The result was a plain text file, with each pre-processed Swiftly job on a new line, like this:

I then used the gensim tools to create the vector model required for LDA. On the recommendation of the gensim authors, I also removed all tokens that only appeared once. The doc2bow function used in the MyCorpus class below converts the document into the vector space format discussed above.

12345678910111213141516171819202122232425262728293031

fromgensimimportcorpora,models,similarities# pre-process swiftly jobs, each job on a newlineCORPUS="StemmedStoppedCorpus.txt"defcorpus():forlineinopen(CORPUS):yielddictionary.doc2bow(line.split())# create dictionary mapping between text and idsdictionary=corpora.Dictionary(line.split()forlineinopen(CORPUS))# find words that only appear once in the entire doc setonce_ids=[tokenidfortokenid,docfreqindictionary.dfs.iteritems()ifdocfreq==1]# remove once wordsdictionary.filter_tokens(once_ids)# "compactify" - removes gaps in ID mapping created by removing the once wordsdictionary.compactify()# save dictionary to file for future usedictionary.save("swiftly_corpus.dict")# create a corpus objectswiftly_corpus=MyCorpus()# store to disk, for later usecorpora.MmCorpus.serialize("swiftly_corpus.mm",swiftly_corpus)

Regarding the above code, the MM file is a file format known as Matrix Market format, which represents a matrix of sparse vectors. The dictionary file above simply maps the word_id integers that are used in the MM format to the actual word each id represents.

Applying LDA

Now that the corpus has been stored as a matrix of vectors, we can apply the LDA model and start clustering the Swiftly jobs. This is done with the following lines of code. We can generate different models by changing the num_topics argument in the ldamodel.LdaModel() function.

The number before each token represents how discriminating that token is for the category. Ideally, by eyeballing the discrimiating tokens for a topic we could understand and identify it, giving it a useful name. As you can see, this proved to be difficult. I suspected that there are probably more than six unique categories of tasks on Swiftly, so I run LDA with N_TOPICS set to different numbers. With 15 (this time just top 10 words, without numbers, formatted into a table for easier comprehension), the results are:

TOPIC1

TOPIC2

TOPIC3

TOPIC4

TOPIC5

TOPIC6

TOPIC7

TOPIC8

TOPIC9

TOPIC10

TOPIC11

TOPIC12

TOPIC13

TOPIC14

TOPIC15

imag

need

element

tree

yellow

creativ

celebr

file

like

chang

festiv

logo

need

name

x

file

imag

exist

snow

use

take

logo

background

look

color

pdf

card

attach

follow

cover

pictur

attach

etc

santa

view

add

decor

logo

snowflak

blue

send

christma

page

busi

photo

like

size

logo

thanksgiv

new

fun

make

holiday

logo

code

file

use

imag

logo

like

line

file

icon

leav

servic

logo

etc

need

would

red

need

font

text

chang

would

high

word

halloween

gold

replac

pumpkin

word

vector

color

font

back

attach

websit

incorpor

look

resolut

2

app

outlin

team

spooki

possibl

transpar

want

dark

page

like

px

compani

suppli

photoshop

make

add

make

super

bat

add

white

make

green

digit

creat

pictur

card

73

layer

logo

like

fall

feel

skeleton

text

png

someth

match

psd

file

use

line

websit

hand

1

theme

turkey

color

offer

bit

ai

font

panton

version

busi

photo

replac

templat

At this point, I realised that more pre-processing would be requried to get this right. For instance, it seemed strange that in topic 15 the most discriminating word is ‘x’. Looking closer, I realised that this is because topic 15 represents a resize / reformatting job brief. The ‘x’ gets picked out because a large number of customers are specifying dimensions (e.g. 200px x 500px). I was also surprised to find out that ‘73’ was so discriminating, but a little bit of digging revealed that a twitter profile picture is 73x73 pixels. To address this problem, I plan to use a preprocessing step called Lemmatisation.

Lemmatisation is useful for grouping things like numbers, colours, URLs, email addresses and image dimensions together so that different values are treated equally. For example, if there is a specific colour mentioned in a brief, we don’t really care what the specific colour is—we just care that the brief mentions a colour. In our case, we believe that a brief containing a colour (e.g. #FF00FF) or image dimensions (e.g. 400x300) might give us clues about what type of task it is so we convert anything that looks like these to the tokens $COLOUR and $DIM.

Despite the shortcomings of my pre-processing, this clustering task has picked out some interesting topics! Some, as is probably inevitable, are “junk topics”. Further, seasonal words seem to appear in lots of topics, which is a strange result. Despite this, many of the topics are classifiable. Topic 5 was interesting, where ‘yellow’ was such a discriminating term. A very quick (and non-scientific) review of the data suggests that people often do not like the colour yellow (I agree with them!) and want it changed. An attempt to name the topics from the table above:

Topic 1: Change an image so it’s in higher resolution

Topic 3: Change or create a logo or icon, perhaps for a smartphone app

Topic 4: Edits of a seasonal nature (Christmas, Thanksgiving)

Topic 5: Replace yellow (?!)

Topic 6: Halloween edits

Topic 8: Vectorisation task, e.g. “take this png file, turn it into a vector on a transparent background”

Topic 10: Change a colour in some way, often a font. “Panton” is a stemmed form of “pantone”, a popular colour chart

Topic 14: Change copy or update information on a business card

Topic 15: Resize or reformat a photo, often for social media purposes

Having to provide the number of topics to LDA, before you even know what’s reasonable, feels like a chicken-and-egg problem. It’s possible to try different numbers of topics and eyeball the results, but at times it felt a bit too much like guesswork. Nevertheless, I view these results as a decent “proof of concept”. It’s reassuring that a computer can find categories like this, and suggests that with more tweaking and a nicely labelled dataset, the job of automatically classifying Swiftly task briefs is entirely possible!

Next time…

That wraps up my experiments with unsupervised classification for this post. Next time, I plan to discuss my efforts after I settle on the Swiftly categories. I’d like to develop a nice labelled training data set (most likely using Amazon’s Mechanical Turk service), and then experiment with supervised machine learning techniques. I will also detail my efforts at a developing a more sophisticated pre-processing procedure. Tune in!

About Daniel

Daniel Williams is a Bachelor of Science (Computing and Software Science) student at the University of Melbourne and Research Assistant at the Centre for Neural Engineering where he applies Machine Learning techniques to the search for genetic indicators of Schizophrenia. He also serves as a tutor at the Department of Computing and Information Systems. Daniel was one of four students selected to take part in the inaugural round of Tin Alley Beta summer internships. Daniel is an avid eurogamer, follower of “the cricket”, and hearty enjoyer of the pub.

We recently replaced most of our image resizing code with Thumbor, an
open-source thumbnailing server. This post describes how and why we migrated to
a standalone thumbnailing architecture, and addresses some of the challenges we
faced along the way.

Background

Historically, 99designs has largely been powered by a monolithic PHP
application. Maintaining this application has become increasingly difficult as
our team and codebase grow. One cause of this difficulty is that the application
contains a lot of incidental functionality—supporting code that isn’t the
core purpose of the application, but which is necessary for its operation.

As such, we set ourselves a technical goal in 2013 to migrate to a more
service-oriented architecture. This means breaking big masses of functionality
into discrete services and libraries that do one thing well. Such a design tends
to yield smaller, more cohesive services, and provides natural lines along which
our team can subdivide.

Image thumbnailing is a generic function required by many graphics-intensive
websites, and a prime candidate for extraction into a standalone service.

Thumbnails at 99designs

Our 230,000+ strong designer community uploads a new image to 99designs every ~6
seconds. We serve several thumbnail variations of these images across the site.

Our thumbnailing solution needs to scale to serve our production traffic load.
The approach we’ve used until recently has been to generate thumbnails
ahead-of-time using asynchronous task queues. Every time a designer
uploads an image, we kick off a task that generates thumbnails of that image and
stores them in S3:

If a thumbnail request arrives while the task is generating the thumbnail, we
serve a placeholder image:

Once the thumbnailing task finishes, we can serve the resized images:

This architecture has served us pretty well. It keeps response times low and
scales nicely, but it has a few shortcomings:

We’ve intertwined the image resizing logic with our PHP application. Other
apps in our stack have to implement their own resizing.

It’s not the simplest solution. There’s quite a bit of complexity: deduping
resize tasks, using client-side polling to check if a resize operation has
completed, etc.

We can only serve thumbnails at predefined sizes. If we decided to introduce
a new thumbnail size, we’d have to generate that thumbnail for tens of
millions of existing images.

A better solution is to create a separate, simpler thumbnailing service that
any application in our stack can use.

Thumbor overview

Enter Thumbor. Thumbor is an open-source thumbnail server developed
by the clever people behind globo.com. Thumbor resizes
images on-demand using specially constructed URLs that contain the URL of the
original image and the desired thumbnail dimensions, e.g.:

In this example, the Thumbor server at thumbor.example.com fetches
llamas.jpg from images.example.com over HTTP, resizes it to 320x240 pixels,
and streams the thumbnail image data directly to the client.

At face value this seems less scalable than our previous task-based solution,
but some careful use of cacheing ensures we only do the resize work once per
thumbnail.

New architecture

The high-level thumbnailing architecture now looks like this:

Our applications generate URLs that point to a Thumbor server (via a CDN). The
first request for a particular thumbnail blocks while Thumbor fetches the
original image and produces the resized version. We set long cache expiry times
on the resulting images, so they’re effectively cached forever. The CDN serves
all subsequent thumbnail requests.

We put a cluster of Thumbor servers behind an elastic load balancer to cope with
production traffic. This also gives redundancy when one of the servers dies.

The resulting architecture is very simple, and our image-resizing capability is
neatly encapsulated as a standalone service. This means we avoid the need to
re-implement thumbnailing in each of our applications—all that’s needed is
a small client library to produce Thumbor URLs.

Usage example

We created Phumbor to generate Thumbor URLs in PHP applications.
Here’s how you might implement a Thumbor view helper:

Implementation strategy

We used a couple of complementary techniques to test Thumbor’s capabilities
before committing to its use in production.

Firstly, we used feature-flipping to selectively enable Thumbor URLs
for certain users. Initially we used this to let developers click around the
site and check that Thumbor was generating thumbnails correctly.

Secondly, we used asynchronous tasks to simulate a production traffic load on
the Thumbor service. Every time an app server handled a thumbnail request, we
enqueued a task that requested that same thumbnail from the new Thumbor service.
This allowed us to check performance of the service without risking a disruption
to our users.

Finally, we used our feature-flipping system to incrementally roll out Thumbor
thumbnails to all our users. This worked better than immediately pointing all
traffic at the Thumbor service, which tended to cause a spike in response times.

Thumbor configuration

Some of our Thumbor configuration settings differ from the recommended defaults.
We tweaked our configuration in response to our performance measurements.

Thumbor ships with a number of imaging backends; the default and recommended
backend is PIL. Our testing shows that the OpenCV backend is much faster (i.e.
3-4x faster) than PIL. Unfortunately, OpenCV can’t resize GIFs or images with
alpha transparency. As a result, we implemented a simple multiplexing backend
that delegates to OpenCV wherever possible and falls back to PIL in the
degenerate case.

Generally we’ve found that Thumbor is quite stable, and expect it to further
mature as more people use it and make improvements.

Conclusion

Our Thumbor service now serves all design entry thumbnails for our main PHP
application. The resulting architecture is much simpler and the service is
usable by other applications in our stack. We’ll continue to use Thumbor
in future apps we develop, and look for more opportunities to simplify our
codebase by progressively adopting a more service-oriented architecture.

Two years ago, 99designs had localized sites for a handful of English speaking countries, and our dev team had little experience in multilingual web development. But we felt that translating our site was an important step, removing yet another barrier for designers and customers all over the world to work together. Today we serve localized content to customers in 18 countries, across six languages. Here’s how we got there, and some of the road blocks we ran into.

Starting local

The most difficult aspect to internationalizing is language, so we started with localization: everything but language. In particular, this means region-appropriate content and currency. A six-month development effort saw us refactor our core PHP codebase to support local domains for a large number of countries (e.g. 99designs.de), where customers could see local content and users could pay and receive payments in local currencies. At the end of this process, each time we launched a regional domain we began redirecting users to that domain from our Varnish layer, based on GeoIP lookups. The process has changed little since then, and continued to serve us well in our recent launch in Singapore.

Languages and translation

With localization working, it was time to make hard decisions about how we would go about removing the language barrier for non-English speakers (i.e. the majority of the world).There were a lot of questions for us to answer.

What languages will we offer users in a given region?

How will users choose their language?

How will we present translated strings to users?

How will strings be queued for translation?

Who will do the translation?

What languages to offer?

Rather than making region, language and currency all user selectable, we chose to restrict language and currency availability to a user’s region. This was a trade-off which made working with local content easier: if our German region doesn’t support Spanish, we avoid having to write Spanish marketing copy for it. Our one caveat was for all regions to support English as a valid language. As an international language of trade, this lessens any negative impact of region pinning.

Translating strings

There were two main approaches we considered for translation: use a traditional GNU gettext approach and begin escaping strings, or else try a translation proxy such as Smartling. gettext had several advantages: it has a long history, and is well supported by web frameworks; it’s easily embedded; and translations just become additional artifacts which can be easily version controlled. However, it would require a decent refactoring of our existing PHP codebase, and left open issues of how to source translations.

In Smartling’s approach, a user’s request is proxied through Smartling’s servers, in turn requesting the English version of our site and applying translations to the response before the user receives it. When a translation is missing, the English version is served and the string is added to a queue to be translated. Pulling this off would mean reducing substantially the amount of code to be changed, a great win. However, it risked us relying on a third-party for our uptime and performance.

In the end, we went with Smartling for several reasons. They provided a source of translators, and expertise in internationalization which we were lacking. Uptime and performance risks were mitigated somewhat by two factors. Firstly, Smartling’s proxy would be served out of the US-East AWS region, the same region our entire stack is served from, increasing the likelihood that their stack and ours would sink or swim together. Secondly, since our English language domains would continue to be served normally, the bulk of our traffic would still bypass the proxy and be under our direct control.

Preparing our site

We set our course and got to work. There was substantially more to do than we first realized, mostly spread over three areas.

Escaping user-generated content

Strings on our site which contained user content quickly filled our translation queue (think “Logo design for Greg” vs “Logo design for Sarah”). Contest titles, descriptions, usernames, comments, you name it, anything sourced from a user had to be found and wrapped in a <span class="sl_notranslate"> tag. This amounted to a significant ongoing audit of the pages on our site, fixing them as we went.

Preparing Javascript for translation

Our Javascript similarly needed to be prepared for translation, with rich client-side pages the worst hit. All strings needed to be hoisted to a part of the JS file which could be marked up for translation. String concatenation was no longer ok, since it made flawed assumptions about the grammar of other languages. Strings served through a JSON API were likewise hidden from translation, meaning we had to find other ways to serve the same data.

Making our design more flexible

In our design and layout, we could no longer be pixel-perfect, since translated strings for common navigation elements were often much longer in the target language. Instead, it forced us to develop a more robust design which could accommodate the variation in string width. We stopped using CSS transforms to vary the case of text stylistically, since other languages are more sensitive to case changes than English.

The wins snowball

After 9 months of hard work, we were proud to launch a German language version of our site, a huge milestone for us. With the hardest work now done, the following 9 months saw us launch French, Italian, Spanish and Dutch-language sites. Over time, the amount of new engineering work reduced with each launch, so that the non-technical aspects of marketing to, supporting and translating a new region now dominate the time to launch a new language.

The challenges

We also encountered several unexpected challenges.

Client-side templating

We mentioned earlier that the richer the client-side JS, the more work required to ensure smooth translation. The biggest barrier for us was our use of Mustache templates, which were initially untranslatable on the fly. To their credit, Smartling vastly improved their support for Mustache during our development, allowing us to clear this hurdle.

Translating non-web artifacts

It should be no surprise: translation by proxy is a strategy for web pages, but not a strong one for other non-web artifacts. In particular, for a long time translating emails was a pain, and in the worst case consisted of engineers and country managers basically emailing templates for translation back and forward. After some time, we worked around this issue by using Smartling’s API in combination with gettext for email translation.

Exponential growth of translation strings

Over time, we repeatedly found our translation queue clogged with huge numbers of strings awaiting translation. Many of these cases were bugs where we hadn’t appropriately marked up user-generated content, but the most stubborn were due to our long-tail marketing efforts. Having a page for each combination of industry, product category and city led to an explosion of strings to translate. Tackling these properly would require a natural language generation engine with some understanding of each language’s grammar. For now we’ve simply excluded these pages from our translation efforts.

The future

This has been an overview of the engineering work involved in localizing and translating a site like ours to other languages. Ultimately, we feel that the translation proxy approach we took cut down our time to market significantly; we’d recommend it to other companies who are similarly expanding. Now that several sites are up and running, we’ll continue to use a mix of the proxy and gettext approaches, where each is most appropriate.

We’re proud to be able to ship our site in multiple languages, and keen to keep breaking down barriers between businesses and designers wherever they may be, enabling them to work together in the languages in which they’re most comfortable.

I recently found myself wanting the features of the rails asset pipeline in my golang project at work. Since there isn’t much in the way of asset pipelining for golang yet, I built it. Turns out, sprockets is really easy to integrate. Here’s how you can go about setting it up for your project.

Assets in development

First things first - lets get it to the ‘it works on my machine’ stage. I’ve put together a sample repo using the asset pipeline, which you can use as a guide.

The setup for your app will be similar:

The assets folder contains your stylesheets, javascript, etc (this directory name is set in sprockets/environment.rb).

You’ll need a similar Rakefile to build assets (and maybe launch the server)

When your app starts up (in development), it should make a request to http://localhost:11111/assets/manifest.json, which provides a JSON hash linking asset names (e.g. “application.css”) to the relative URLs the compiled assets can be fetched from. To generate a link to an asset in your app, use the JSON hash you fetched to lookup the URL. For example, the URL for “application.css” this might look like http://localhost:11111/application-8e5bf6909b33895a72899ee43f5a9d53.css.

That should be all you need for development - you should be able to see SASS/Coffeescript assets compiled and loading normally. Hooray!

Assets in production

For production we want to pre-compile assets rather than regenerating them each time they change.

rake assets will create a ‘public’ folder containing ‘manifest.json’ (same format as before). Get this directory onto your production servers. git add -Af public/ will add it to source control if you deploy via git.

When generating a link to an asset, simply look up manifest.json from the filesystem rather than from HTTP.

Fin

If you’ve followed these steps, you’ll have a fullly functioning asset pipeline for your golang project. The whole thing, including deployment, took me well under a day to add to our app. The resulting assets are minified, concatenated, and gzipped (for size). They are also fingerprinted, so you can serve them with an unlimited cache lifetime and reap the benefits.

Although I set this up for golang, there’s nothing go-specific about it. The same technique works just as well for any language or framework without a mature asset pipeline. If you find yourself in need, just use this pattern and you can be up and running in no time.

At 99designs, we try to make sure we’re always fixing bugs as well as writing
code. It can be easy to neglect bugs when you’re busy churning out new features.

We use GitHub issues to track bugs in our various applications. GitHub issues
integrate well with our codebase, commits and pull requests, but the reporting
facilities are a bit limited.

As our team grows, it’s become increasingly important for us to be able to
answer key questions about bugs, including:

How many bugs are currently open?

Have we each remembered to spend time working on bug fixes this sprint?

Are we closing more bugs than we’re opening?

To help answer these questions, a few of our team spent a number of
hack days
implementing a bug dashboard named GitHub Survivor.

Unlike the similarly-named reality TV show, GitHub Survivor doesn’t feature
eliminations, gruelling physical challenges, or Jeff Probst. However, it does
pit developers against one another — in a light-hearted way.

We display GitHub Survivor on a big screen in the office, where all the team can
see it. We’ve found it helps keep our minds on bugs — it reminds us to
make a small effort every sprint, gradually bringing the bug count closer to
zero.

A bug leaderboard occupies the bulk of the screen. It shows who’s closed the
most bugs this sprint (may they be laden with Praise and Whisky!) and who’s
forgotten to spend some time fixing bugs (may they toil in the maintenance of a
thousand Malbolge programs!).

There are charts showing the number of bugs opened and closed in recent sprints,
the open bug count over time, and a big indicator showing the current open bug
count.

The source is available for you
to inspect and adapt to your needs. Please try it out, make improvements and
contribute them back! We hope you find it useful.

We’re passionate about building high-quality software at 99designs, and this
is just one way we measure whether we’re doing a good job of that. If you’re
similarly interested in building cool things in an awesome environment,
check out our open positions!

Cool, so what else can we do with this? It’s trivial to define a method with a space in its name, and calling it isn’t terribly difficult:

12345678

1.9.3-p286:005>self.class.send(:define_method,:"i have a space")do1.9.3-p286:006>puts"I has a space"1.9.3-p286:007?>end=>#<Proc:0x007ff89c1e0b58@(irb):5 (lambda)>1.9.3-p286:008>send(:"i have a space")Ihasaspace=>nil1.9.3-p286:009>

But having created such a monstrosity, how do you call it from the repl? Or for that matter, from an actual Ruby program? This is obviously something you should be doing in production…

12345678910111213

self.instance_execdodefmethod_missing(sym,*args)# Splat args if passed in from a parent callifargs.length==1&&args[0].is_a?(Array)&&args[0][0].class==NameErrorargs=args[0]endmethod_names,arguments=args.partition{|a|a.class==NameError}method([sym.to_s,*method_names.map(&:name)].join(" ")).call(*arguments)rescueNameError=>ereturn[e,*arguments]endend

Bam. You may be looking at this baffled (or if you’re reasonably tight with metaprogramming in Ruby, sharpening/setting fire to something with a view to causing me significant bodily harm).

Walking through this, we first of all act on whatever self is; in most cases this will be the local scope. If we didn’t do this, we’d be defining the method on Object, which can cause all kinds of headaches when you’re trying to debug.

Immediately after this, we unpack arguments if they look like they were created by an earlier instance of this method. This is unwieldy, but unfortunately Ruby’s single return values and the recursion we’re employing here make it necessary. We could definitely define a subclass of Array to make the test cleaner and the implementation more robust, but I preferred to keep this as short as possible and use the bare minimum number of Ruby primitives.

Once we’ve unpacked our arguments, we do the real magic. First off, we split our arguments into NameErrors, the container we’re using for our missing method names, and everything else (the legitimate arguments we were called with).

We try to find a method with the current name (as we’ll be building our method name right to left with recursive calls to method_missing), and failing that we pack up our current attempt with our arguments, and return it for the next pass.

There are enough issues with this (if you defined the methods foo bar baz and bar baz, a call to foo bar baz would call foo with bar bazs return) to make it unwieldy. On the other hand; if those bugs are the only thing stopping you from putting this into production, you’ve probably got larger issues.

If this large scale abuse of the language excites you, you might be interested to know that we’re hiring.

At this point you’re probably eager to know.. does it work?

123456789

1.9.3-p286:001>load"bare_words.rb"1.9.3-p286:002>self.class.send(:define_method,:"i has a space")do|name,greeting|1.9.3-p286:003>puts"#{greeting}, #{name}!"1.9.3-p286:004?>end=>#<Proc:0x007fc6b41872c0@(irb):2 (lambda)>1.9.3-p286:005>ihasaspace"richo","Hello"Hello,richo!=>nil1.9.3-p286:006>

]]>2012-09-19T13:10:00+10:00http://99designs.com/tech-blog/blog/2012/09/19/talk-like-a-pirate-dayby David Lutz

If you had happened to have been wandering around the 99designs office today you would have heard hysterical laughter and cries of “yarrrr”.

Something we try to do is foster a good “DevOps” working culture. One of the critical components of DevOps is a collaborative way of working and close working relationship between those working in development and operations. And other parts of the business for that matter. Company culture is something that is tricky to improve if it’s not working, but immediately obvious when it is working.

A key component of good company culture is good communication. We make extensive use of IRC as a communication medium. Naturally we use it for technical discussions like “How does this component of code work”, but just like companies like GitHub and Etsy we use IRC within the company as a very effective method of documenting what both dev and ops are doing day to day. We also have a bot that lives in IRC and does useful things for us. Our bot is called agent99. She makes our life easier and has the ability to do such things as deploy new versions of code to our production website.

She also has a bit of character. She can find memes to punctuate the moment, or fetch funny pictures of cats when asked.

We’ve actually open-sourced agent99 as well as many other pieces of code. We like to contribute code that we think might be useful to others back to the community.

So what was the source of hysterical laughter? Well today is Talk Like a Pirate Day. So one of our staff hacked together a plugin to make agent99 send a “yarr” over the office speakers. Hilarity ensued. It was quicky repatched, so that “yarr” only works one golden day a year.

This is a little example of where all are empowered to use technology to make it a more enjoyable workplace. And a work environment that is both fun and technically challenging is a competitive advantage in that helps attract and retain motivated and happy employees!

Nerdy technical details

For those interested in the technical details, color explorer uses colorific to extract colors from logo designs (We’ve written more about colorific previously). It’s also using colordb, a side project of mine designed to efficiently index and search by color using some fancy perceptual nearest-neighbour algorithms.

Colordb turned out to be super easy to implement thanks to a few excellent Python libraries: Rtree, Flask and colormath.

Rome wasn’t built in a day—but the 99designs color explorer definitely was built in a day thanks to these great libraries.

R-trees

R-trees are a useful data structure for indexing spatial data such as the location of planets in the solar system or coffee shops in your local area. In this case, colordb uses an R-tree to index colors since colors can also be kind of spatial and three dimensional (colordb uses the Lab color space that has L, a, and b dimensions instead of the typical x, y and z). Normally you’d use an R-tree to help find nearby coffee shops but I’ve used it instead to find nearby colors.

The best part about R-trees is that they let you do nearest-neighbour searches efficiently. This means that you can search for a color and it will return results with colors that are similar—it means the search doesn’t have to be exact.

By indexing colors in the Lab color space, colordb is also able to give search results that are perceptually similar. Color differences in the Lab color space are perceptually uniform. This means that comparing colors in the Lab color space matches how human eye perceives color.

Flask

Flask is a micro-framework for making little web apps in Python. Flask made it dead-easy to add a simple HTTP interface to colordb in order to integrate colordb with the rest of the 99designs codebase.

Putting it all together

So, there’s a couple of things that make color explorer work:

We used colorific to extract the top two colors in a design and insert them into colordb.
At the time of writing, there are ~120,000 designs indexed in colordb.

The color explorer backend code connects to colordb over HTTP to search for a color. Colordb returns designs that contain a color similar to the one searched for. Color explorer is then able to serve up the HTML markup with the matching designs into what you see on the page.

That’s about all there is really. It’s quite simple.

Why?

So, what’s the point of all this? Well, mostly because I thought it’d be fun to make. :-)

But there’s another reason. I’m a big fan of using clever computer science to make software simpler. I love the idea of making user’s lives easier by using sophisticated algorithms to reduce complexity.

Color explorer is an experiment in taking an open ended problem such as “search designs by color” and using clever computer science to create a solution that is simple, powerful and easy to use.

Some other great examples of this principle in practice are Google’s search and Apple’s Siri.

I think it’s pretty amazing that Google is able to hide an incredibly sophisticated search engine behind a single text input. Likewise, Apple’s Siri is able to combine sophisticated machine learning, voice recognition and language analysis to enable people to perform tasks simply by speaking into their phone.

Isn’t that remarkable?

Acknowledgements

This experiment was built as part of a 99designs R&D day. Every second Friday we have a day to spend on an interesting work related side-project. So, a big thank you to 99designs for letting our dev and design team loose to make cool and crazy stuff.

Discuss

At 99designs we love great design, and a big part of good design is use of
color. We were interested to see how designers make use of color in their
designs, so we built an automatic color extractor to enable us to analyse
color usage at a massive scale.

Introduction

Think of a design you love. Part of the story it tells is in the colors it
uses, the contrast of light and shade, and the subtle emotions those colors
convey.

It’s pretty easy for a human to tell what colors are important in a design.
Take this logo below for example. Most people identifying the important colors would
come up with something like this:

#bf1e2e

#569cbe

Images are made up of pixels, and if you just count the colors of every pixel,
you don’t get anything like the list of colors above. This post is about our
journey working out how to automatically work out an image’s color palette that
is close to what a real person would pick.

Problem 1: Detecting background color

Quick quiz: What’s the primary color in this image?

Answer: White!

The problem here is that simply counting the number of pixels with a given
color the background color nearly always dominates. We need to work out the
background color so we can exclude it.

We found a simple approach that works well in most cases. We
found that if the pixels in the corners of the image are the same color, that
color was the background.

Let’s look again. In this case, the corner pixels are: #ffffff, #ffffff,
#ffffff, #ffffff – all white. We can thus safely exclude white as
a background color, leaving red as the most frequent color. Nice!

Problem 2: Too many colors

Quiz time again, how many colors are in this image?

4 colors right? Not quite. Look closer at the G for example:

There’s actually 255 different colors (an occult number for computer
scientists). If we take the top four, we get:

#01a514

#0048f1

#1d4cee

#ea2434

Aww, not that great. It has two very similar blues, and misses the yellow
entirely.

The problem is that colors that are not exactly the same are treated
differently to a computer. All the different shades and
variants of colors mess up the counts. Humans easily group sets of colors together
though, and ideally our program would do the same.

So how can we judge if two colors look the same to the human eye or not?
Fortunately, a bit of color theory comes in handy here.

Aside: Color theory

On a computer, colors are usually represented in the RGB color space. This
means that a color is made up of three components: Red, Green and Blue. To work
out the distance between two colors you can use the Euclidian distance of the
components.

Comparing colors in the RGB color space works ok, but it’s not
perfect. Differences in RGB don’t accurately match how the human eye perceives
color. For example, yellow often appears brighter to humans than a blue of the
same brightness. Also, humans can perceive smaller differences in green hues
than in pink.

A better way to compare colors is to convert to the Lab color space. Lab is
designed to allow comparison in a way that matches how the human eye perceives
color. A color in the Lab color space has three components: “L” represents lightness,
the “a” component ranges from green to magenta, the “b” component ranges from
blue to yellow. The distance between colors in the Lab color space is often
called the delta-E. A delta-E of less than 1.0 means that the human eye
cannot tell the difference between two colors.

We can take advantage of this to group similar colors together in a way that matches how the eye would do it naturally.

Merging colors together

By combining visually similar colors, we’re able to address most of
the earlier problems. The color theory above tells us which colors to merge:
pairs with a low delta-E are visually similar, and can be safely combined.
On our noisy image, this leads to a clearer list of colors:

#0048f1

#ea2434

#f8a912

#01a514

Much better!

As it turns out, there’s a whole lot of situations in which extra colors get
added: antialiasing on edges of shapes, image compression artifacts, textures
and gradients all add to the number of different colors that occur, even
if they don’t change the overall palette. This technique helps to deal with
these issues. It can leave behind very low counts of some noisy colors
– an additional threshold filter helps to clean up colors that don’t
occur very often.

Problem 3: “interesting” colors

Now we can work out what colors are used in a design, but it turns out that a
lot of the colors we find aren’t that visually interesting.

Which colors are the most interesting here?

When we ask people, they usually pick the brighter and more distinctive colors.
But the palette this image uses is much more like this:

#d0d0d0

#000000

#a0a0a0

#323232

In fact, it seems like grays and subdued shades are often used used as fillers
to give highlights more impact. How could we isolate these distinctive colors?

Excellent question! Lab coats on!

After flicking through designs until our retinas got tired, we turned to color
theory again. It turns out color theory has a name for the concept of color
interestingness: “saturation”.

Cutting out colors with a low saturation gives you much more interesting
results.

#d40000

#520000

We can now automatically work out the palette for an image that
closely matches what a human would pick.

Introducing Colorific

At 99designs we love open source – so we’re releasing Colorific, our automatic color palette detector. Check it out at github.

The 99designs Research and Development programme enables members of our
development team to spend a full day every two weeks (what we Australians call
a “fortnight,” much to the amusement of our American colleagues) to work on
projects we think are important. Read on to find out why 99designs does this,
how it works in practical terms, and the successes and challenges we’ve
encountered.

Why we do it

Everyone has a proverbial “itch to scratch” and our R&D programme allows
us to, well, scratch it. Companies allocating time for their best and
brightest to innovate is not a new school of thought. Many, technology based
and otherwise, have realised that the return on investment of allowing
employees to work autonomously on projects of their own choosing is
substantial.

99designs’ programme serves as an incubator for ideas, and it follows that one
tangible result is fantastic new products. Also, we learn new skills by
dabbling with brand new technologies and then applying the lessons learned in
our day-to-day tasks. And then, of course, there is the more difficult to
measure result: greater staff satisfaction.

We’re not alone

A good example of a company that has been doing this for a very long time is
the one that brought us the little yellow notes you’ll likely see if you
glance around you right now, scribbled with clever quips or very insecure
password reminders and stuck to computer monitors. Bill Coyne, a former senior
vice president of R&D at 3M, put forward one of the best arguments I’ve
encountered about why it’s key for the company’s engineers to spend 15 percent
of their time pursuing their own ideas:

Most of the inventions that 3M depends upon today came out of that kind of
individual initiative… You don’t make a difference by just following orders.

The 15 percent rule was instituted after an employee at the then-struggling
company went a bit rogue at work and invented the world’s first masking tape
back in 1925. Post-It notes hit the market in 1980. The company now employs
7,350 researchers and has sales of $27 billion a year.

In the technology sector, Atlassian and Google are
two well-known evangelists of R&D time. Google’s “20 percent time” paved the
way for key products such as Gmail, Chrome and Google News. Atlassian credits
its own 20-percent programme with helping to keep its developers so happy that
it doesn’t need to pour a lot of money into recruiting new talent. Prospects
simply hear about its innovative culture and climb over one another to join
them.

How we do it

99designs’ R&D days are scheduled at the end of our fortnightly
development sprints, so that we have one every second Friday. The day gives us
a nice buffer in which to get creative after our projects and other business
tasks are wrapped up Thursday and before a new sprint starts Monday.

We like to hold an informal meeting the day before our R&D day to share
ideas about what we’d like to tackle. Not everyone has a killer idea every
time, and it’s not uncommon for a colleague to describe a project that sounds
a lot cooler then what you had in mind. The get-together allows teams to self
organise and game plans to develop. Another benefit of holding a prep meeting
is that it gets us all amped up to hack on whatever we want all day the
following day.

The Monday after an R&D day, each team or individual gives a five-minute
presentation on what they achieved, what they learned and whether it’s
something they want to pursue further.

Successes so far

The 99designs R&D program was implemented six months ago and in those 10
or so days we’ve managed to come up with some useful tools and fun apps.

One of our very first projects to be adopted as a mainstream product feature
was an admin toolbar. This project came from a few developers who were
frustrated with how difficult it was to search for and manage contests, so
during our very first R&D day we decided to form a team and build
a solution. To the delight of our staff, the toolbar is now part of the
application and used on a daily basis.

Mapviz pans across the world, showing logins and
design entries

Another frustration among developers was the lack of visibility and subsequent
difficulty managing errors within our application. Yes, we have error logs.
But who in their right mind likes trawling through them to get to the bottom
of an issue? “Triage” is the aptly named error management system we developed
over eight R&D days. This open source, ongoing project will soon be rolled
out into our application and will make our lives a lot easier.

On the fun side, one of our developers created a nice little app called
Mapviz, which provides a real-time visualisation of all customer log-ins,
contest launches and design entries on a world map. Our marketing team was
blown away by it.

Challenges

One challenge we face is coming up with a way to involve all of our teams,
including support and marketing, in the R&D programme. Involving
non-technical team members from time to time could allow the development team
to get inspiration and ideas from a different angle. Then there’s the
challenge that comes with being a global company – the Pacific Ocean creates
quite a barrier between our Melbourne and San Francisco offices, making it
tough for us to consistently communicate ideas efficiently and effectively.

Secondly, is an R&D day twice a month the best approach for fostering
innovation? Our friends at Flippa have taken on what
they like to call “Triple Time,” in which the development team dedicates three
consecutive days every month to R&D. This approach might work well for us,
however, giving it a go will be a challenge when we are so used to our current
programme.

Another interesting approach is the “FedEx day” championed by
Atlassian, during which a team is given one day to build and demonstrate
a finished or near-complete product. Could introducing a deliver-in-a-day
element hinder or supplement our current programme?

Looking ahead

The evidence is clear that we’re off to a good start – we’re delivering some
great products, helping to foster an innovative company culture and learning
new technical skills. However, like everything at 99designs, the programme is
continually evolving and improving.

I’m curious to learn more about what other companies are doing to foster
innovation. Does your company have an R&D program? How is it structured
– as a percentage of time, designated days, another format? How else are
employees encouraged to innovate?

This post describes feature flipping, an approach to development that helps
solve some of the issues associated with risk management and quality assurance
when a fast moving development team expands.

Continuous deployment in large teams

Developers at 99designs use agile and lean startup methodologies, but as our
development team gets bigger, deployments happen more frequently. This volume
of change brings increased risk for the stability of our site. It can also be
quite a challenge to measure the success of a single new feature on a rapidly
changing website with multiple new features operating at any given time. In
our earlier days we’d demo new features on a staging server, but as 99designs
has grown, using several staging servers to demo all our new features at once
has become clunky. So how do we solve these issues? Feature flipping.

What is feature flipping?

Feature flipping enables the ability to turn site features on or off on
a per-user basis. You’ve probably encountered it before with companies like
Google or Facebook when they’re rolling out major changes. A few examples
include the recent UI changes to Google Docs and Google Mail, and Facebook’s
new Timeline. Our approach was inspired by our friends at
Learnable, and
slots in well with the Lean Startup methodology
of releasing minimum viable products, measuring, and adapting through fast
feedback.

Rolling out features incrementally gives companies the ability to ensure the
appropriateness and stability of a feature. Some companies make this visible
to users via an opt in/out approach, but in our case we use this internally as
an improved form of A/B testing. Traditionally we would only A/B test on
landing pages and static content, but feature flipping also allows us to
experiment with pervasive site functionality.

How it works at 99

Before running through our approach, let’s first examine it from a developers
perspective and from a site admin’s perspective.

Developers

As a developer, you create a new feature simply by registering it in
a specific module in our codebase.

1

features.add(name='raptors',description='Unleashes raptors over the site')

Now that we’ve defined our feature, it’s available for use in a conditional
statement, which you use to guard feature-specific code.

12

iffeature('raptors'):unleash_raptors()

Boom! Our raptors feature can now be toggled on or off for any user who
interacts with our site.

Do you like raptors? We like raptors!

To kick off an incremental roll out, a commit is required with a defined set
of eligibility criteria a user must meet before the feature is enabled for
them. For example, the user must be a designer and needs to have submitted
a minimum of 5 entries.

1

feature('raptors').rollout(role='designer',min_entries=5)

Of course, we make sure to test the new behaviour in addition to the old.

Admin

99designs staff get a nice interface where they can turn on and off features
for themselves or a specific user. Here they can also view the progress of any
experiments hanging off this feature. Many of our stakeholders are located
remotely; by giving them the power to enable features themselves, they can try
these out in production and provide feedback. This removes the need for
a dedicated staging environment for demoing.

Enabling or disabling a feature is just a button
click

Under the hood

When we first considered how best to implement features, we recognized that
every feature we introduced would require us to perform an additional
user-feature check. This would clearly result in rapid growth of queries as we
began checking features for every user on every page of our site.

For logged out users, we decide to to go with a cookie based solution.
However, for logged-in users we wanted our user features to be persistent. We
thought about using MySQL, but based on Redis
benchmarks and its inbuilt
sets data structure, going this route seemed an ideal solution. We decided
to go with a combined cookie and Redis approach to minimise the number of
database lookups without impacting the performance of our site.

These two methods work well together. When a user logs into 99designs, cookie
based features are synched up with Redis by taking a simple union of features
enabled and persisting the resulting set. This allows us to determine exactly
who has what features as well as when a user obtained those features.

Challenges

Developing with a multi-version mindset is probably one of the most
challenging parts of feature flipping. What was once a simple deployment
requiring a migration now entails more thought to allow the old and new
versions to function independently of one other.

Deprecated code can start to accumulate when you start leaving code wrapped in
feature checks. We try to curtail this as much as possible by coming back to
clean up unused code paths once a feature has been fully rolled out. Also, we
generally have a grace period even for features we’re decided on, so that we
can roll back if the need arises.

Unit and integration tests become more difficult when you’re working with an
exponential number of feature combinations at any given time. This can lead to
an outsized volume of test cases, a problem we’re still solving.

What now?

Feature flipping is our approach to solving several problems we face as
99designs expands. Managing features gives us fine-grained control over
exactly what our users see, and is paving the way for us to experiment and
adapt to our users much faster than was previously possible.

Big companies often execute this approach to development seamlessly, while
there’s more of a learning curve for small to mid-sized companies. In our
case, we’re still learning but we’re far enough along to be reaping the
benefits.

Where to go from here? We suggest you explore a few open source solutions that
deal with features:

Although our user facing site is the most visible part of what we do, it’s
only half the story. The other half takes part in asychronous queues, which
work overtime behind the scenes, and process hundreds of millions of tasks
each year. In this post I’ll explain a bit about why we use queueing at
99designs, and how it all works.

A Bit of Background

If you’ve never heard of asynchronous task queues before, the idea behind them
is pretty simple. Say you had a task you needed to do such as buying some
milk but you didn’t have the time to take care of it yourself, so instead you
leave a note for a friend/spouse/roommate asking them to do it for you when
they have a chance. Congratulations, you’ve just implemented an asynchronous
task queue.

Why We Use Queues

Now obviously our web apps aren’t busy ordering milk, more common uses for
a queue are things like talking to third party API’s, sending emails, or
performing computationally expensive tasks like image resizing. But why do we
need a queue at all? Wouldn’t it be easier to just do the work a user
requires immediately? Well, there’s a few reasons:

The first reason is speed: When we’re talking to a third party API we have to
face reality; unless that third party is physically located next to our
infrastructure, there’s going to be latency involved. All it would take is
the addition of a few API calls and we could easily end up doubling or
tripling our response time, leading to a sluggish site and unhappy users.
However if we push these API calls into our queue instead, we can return
a response to our users immediately while our queues take as long as they like
to talk to the API.

The second reason is reliability: We don’t live in a world of 100% uptime,
services do go down, and when they do it’s important that our users aren’t the
ones that suffer. If we were to make our API calls directly in the users
requests we wouldn’t have any good options in the event of a failure. We could
retry the call right away in the hope that it was just a momentary glitch, but
more than likely we’ll either have to show the user an error, or silently
discard whatever we were trying to do. Queues neatly get around this problem
since they can happily continue retrying over and over in the background, and
all the while our users never need to know anything is wrong.

The final reason to use a queue is for scalability. If we had a surge in
requests that involved something CPU intensive like resizing images, we might
have a problem if all of our apps were responsible for this. Not only would
the increased CPU load slow down other image resize requests, it could very
well slow down requests across the entire site. What we need to do is isolate
this workload from the user’s experience, so that it doesn’t matter if it
happens quickly or slowly. This is where queues shine. Even if our queues
become overloaded, the rest of the site will remain responsive.

How We Implement It At 99designs

Understanding why we use queues is one thing, actually implementing it is
quite another. In the case of 99designs we chose beanstalk as our queue, it’s
key features being performance, reliability and simplicity. Rather than
having just one centralized queue which our servers push tasks to, we instead
have a separate queue on every app server. Queuing a task locally is always
going to be very quick, and having multiple queues gives us a healthy dose of
redundancy in the event of a beanstalk failure — something I’ve yet to
see.

Of course on its own a queue doesn’t do anything; simply pushing tasks onto it
doesn’t magically make them happen. For that you need a worker. Since a lot of
what 99designs does is PHP-based we had to develop our own custom worker
daemon, as there weren’t any open source solutions for beanstalk. Each worker
node maintains a pool of worker processes to connect to each queue and listen
for new tasks. In the event that our queues start to back up, we can easily
launch a new worker instance thanks to AWS and have it processing tasks within
minutes.

Our one-queue-per-app model

The final part that ties all of it together is how we deal with failures. In
the event that a task fails it doesn’t simply disappear into the ether.
Instead we’ll release the task with a delay, essentially putting it back onto
the queue after a certain amount of time has passed. If a task continues to
fail we use an exponential backoff strategy to prevent failing tasks from
clogging up our queues. If after multiple releases the task still has not
finished successfully then we “bury” it. Buried tasks are then manually
inspected at a later date, and a decision is made to either fix the problem
(if possible), or delete the task if it is deemed unrecoverable.

Work in progress

There’s still more that we’d like to do with our queueing systems. If too many
tasks fail, and are buried, we get alerts which wake us in the night.
Sometimes, this happens because an external service we use goes down,
something we can’t do much about anyway. Unfortunately, our tasks don’t yet
coordinate their back off in any way; each assumes it’s the only one failing.
Ideally, we’d track failures against external APIs, and simply delay groups of
tasks for longer, giving third party services time to recover.

In closing

Asynchronous tasks are a crucial, and often overlooked, part of how any large
site operates. We hope this post gives some insight into why queues are
important, and how they can improve your site’s performance and reliability.
Whilst there are many different queuing architectures possible, we also hope
our recipe serves as a useful pattern for other sites.

99designs hosts design contests, and has been growing rapidly year on
year. This post gives an overview of the infrastructure which powers our site,
how we make sure it continues to hum smoothly, and the challenges we face as
we scale.

A little context

Since its humble origins in ad-hoc contests within Sitepoint’s forums,
99designs has turned out to be one of the success stories of the Melbourne
start-up scene. Although we now have offices in both Melbourne and San
Francisco, the development team is in the Melbourne office, close to where the
action all began. Our team here has 8 devs, 2 dev ops, 2 ux/designers, and is
expanding. Half the people here arrived within the past year, myself included,
amounting to a lot of growth and change over a limited amount of time.

Our site sees hundreds of thousands of unique visitors a month, generating
pageviews in the tens of millions. Since we deal with graphic design, many of
our pages are asset heavy — these pageviews fan out to some 40 times as
many requests. Whilst there are many larger sites on the net, we thought this
was enough to warrant sharing the way we do things.

Requests in layers

The easiest way to describe how we serve requests is to talk about it in
layers, each of which solves a set of problems we face. We’ll cover six pieces
of this puzzle: load balancing, acceleration, application, asynchronous
tasks, storage and transient data.

Load balancing

Let’s start at the beginning of a request. When a user visits 99designs.com,
the request firstly hits our Elastic Load Balancer (ELB). The load balancer is
a highly reliable service which ensures that requests are spread evenly
between the Varnish servers beneath it. It also performs active health checks,
so that requests only hit healthy servers, and SSL unwrapping, allowing us to
work with an unencrypted stack from there on down. On the SSL front, using
a separate ELB for each domain turns out to be a convenient way of running
multiple secure domains.

Acceleration

Our acceleration layer consists of several
Varnish servers, which allow us to serve
a large amount of media with only a limited app stack beneath them. We have
a long-tail of static media, so we run Varnish with a file-based rather than
in-memory storage backend. Varnish is fast, and incredibly configurable
through its inbuilt DSL. Furthermore, its command-line tools for inspecting
live traffic are second to none, and are incredibly useful in tracking down
odd site behaviour.

Application

Dynamic or otherwise uncached requests are served from our PHP application
layer, using Apache/mod_php. Our polyglot team takes inspiration from some of
the best frameworks in the Ruby, Python and Javascript worlds, and we’re not
above porting over what we need. We also open source what we can, for example
our lightweight web application framework
Ergo. Designs which users submit aren’t
stored on every app server, but are instead stored in S3. Since end-user
latency is so poor for S3, we serve designs through our app layer, and cache
them locally after each request.

The high-level components of our stack

Asynchronous tasks

We strive to have a responsive site where users aren’t kept waiting, but often
requests might need to do some extended work, or access an external API.
Integrated with our application layer is an asyncronous layer which tackles
this problem. We queue up tasks using
Beanstalk in-memory task queues on each of
our app servers, using the Pheanstalk
bindings. Beanstalk is known to be lightweight and performant,
with the trade-off that we lose some visibility into the immediate contents of
our queues. A pool of PHP workers listens to these queues, and takes care of
anything lengthy or requiring access to an external API. Tasks which need to
run at a particular time are stored instead in the database, and added to the
queues by cron when they fall due.

Storage

Our storage layer features Amazon’s managed MySQL service (RDS) as the
primary, authoritative and persistent store for our crucial data. An RDS
instance configured to use multiple availability zones provides master-master
replication, providing crucial redundancy for our DB layer. This feature has
already saved our bacon multiple times: the fail over has been smooth enough
that by the time we realised anything was wrong, another master was correctly
serving requests. Its rolling backups provide a means of disaster recovery.
We load-balance reads across multiple slaves as a means of maintaining
performance as the load on our database increases. For media files and data
blobs, we use S3 for redundant and highly-available storage, with periodic
backups to Rackspace Cloudfiles for disaster recovery.

Transient data

Aside from our database proper, there are three services which we use
primarily for transient data: Memcached, MongoDB and Redis. Memcached runs
locally on every application server, with a peering arragement between
servers, and helps us reduce our database queries dramatically. We log errors
and statistics to capped collections in MongoDB, providing us with more
insight into our system’s performance. Redis captures per-user information
about which features are enabled at any given time; it supports our
development stragegy around dark launches, soft launches and incremental
feature rollouts.

Software as infrastructure

99designs strongly follows the “software as infrastructure” mantra. Like many
companies now, we don’t own any hardware ourselves, preferring to remain
flexible, and relying heavily on Amazon’s cloud offering. Growing as we have
has meant a lot of change in a limited period of time, and has built into our
culture a distrust for documentation and the dual-maintenance problem it
creates. Instead, we focus on automation of as much as possible.

We currently use Rightscale to manage our server configurations, which
basically amounts to using a managed form of Chef for provisioning new
servers. We make sure each server type has a recipe which allows us to spin up
replacements at a moment’s notice. This means we can treat servers as
disposable, and mean it.

The layers and services which make up our infrastructure amount to a fair
number of moving parts, so monitoring and keeping track of the distributed
application state is important. We do that through a number of services,
incuding a large number custom monitoring pages, NewRelic, CloudWatch, Statsd
and others. Two large monitoring screens feature prominently in our office,
making sure we’re aware of changes to the site. Despite all this information,
we’re continuously working to get a better understanding around site behaviour
and performance.

Challenges

Whilst the team here has some pride in our accomplishments, there’s a lot we
still have to work on. Here’s some of our biggest challenges:

Scaling back infrastructure, rather than just scaling out. As our site
changes and our customer base grows, the load we place on our backend
systems can vary dramatically. One way to deal with this is to over-provision,
so as to meet such spikes without issue. A challenge for us is to automate and
stress-test even more of our infrastructure, so that we can bring up new
servers even faster and more reliably. This would allow us to confidently
reduce capacity when we have excess, rather than simply expanding.

Providing a strong experience for international customers. We have
a diverse and international customer-base, yet all the action is currently
served out of Amazon’s US-East data center. This leads to quite a disparity in
customer experience. We’re currently trialling CDNs in order to get static
media to our international customers faster, and likewise looking at other
ways we can improve performance.

Balancing feature growth with stability. Being responsive to our
customers means being able to push out new features quickly. In some
companies, this causes a tension between developers who need to get code out,
and ops who are woken in the night by the consequences of a hasty change.
We’re attacking this problem from multiple angles: stronger acceptance testing
should give us better sanity checks on new code that goes out; feature
flipping allows us to incrementally role out new features to only a subset of
users; and finally, we’re working on further automation in order to allow our
developers to be more active in our infrastructure, meaning they can really
own a change from the moment it’s coded to the moment users see it in
production.

Watch this space

This post has given an overview of our current stack, and some of the broad
challenges we face, but we’ve got a lot more to say about our development
style, and the things which make our culture. We’ve benefited greatly from the
open source community and the expertise of those who share their experiences.
Now that we’ve grown, we’re keen to give back a little too, in the hope that
others can benefit from what we’ve learned.