How Slow Searches Killed the Marketplaces (and How We Fixed it)

On Thursday the 10th of January at around 3:25am AEDT (UTC+11) the Envato Marketplaces began
suffering a number of failures. These failures caused searches to stop working,
buy now purchases to be charged but not recorded, newly submitted items to not be
processed, our review queues to be blocked by error pages and the site to be
generally unstable.

We’d like to take the time to explain what happened, why so many seemingly
unrelated areas of the app failed simultaneously and the measures we’ve put in
place to try to prevent similar problems from occurring in the future.

Initial Symptoms

On Thursday at around 3:00am our Solr server
(the search platform used by most of the Marketplaces) began responding to
requests very slowly.

This presented itself in a number of different ways. Users performing searches experienced slow response times, and several other features of the site (seemingly unrelated to search) began to have issues:

Buy now purchases failed to actually record the purchase and give buyers
their items.

Our Delayed Job queues that
handle pushing authors’ files from our servers up to Amazon
S3 and sending out emails began to blow out to huge
lengths, causing many jobs to not be processed.

Reviewers were unable to approve items.

Page load time rose to unacceptable levels, since all the app servers ended
up waiting for the search server to time out before being able to serve other
requests.

To top it all off, our support system crashed.

Initial Response

At around 3:00am, the on-call developer for the week was paged by an automated
trigger for excessively high CPU usage on our Solr server and immediately
started to look into what was happening. He quickly discovered that search
responses were extraordinarily slow, mixed in with many other seemingly
unrelated failures across the application.

His first action was to disable Solr. This made the rest of the site work, but with reduced functionality—we were handling errors returning quickly, but not slow responses.

After an hour the second in line on-call developer came online to help. In order to isolate Solr from the the application for diagnosis, they firewalled off Solr traffic. Under these
conditions the Solr server appeared to respond to locally generated searches
with normal response times but the rest of the app continued to be broken in
weird and wonderful ways. We realised later that this was due to dropping the
Solr traffic instead of rejecting it, which led to the same behaviour as slow
responses.

Eventually, we shut down the Solr server completely which solved the response
times, the review queues and allowed new buy now purchases to complete
correctly. This was possible because our code that communicates with Solr gives up right
away if the server is offline. What we hadn’t accounted for was slow
responses.

The problem now was that this left us with most of the Marketplaces with no
search, our Delayed Job queues were still growing and support was still broken.

Another team in the business maintain our support server, and they were separately
alerted of it being down. They found it in a frozen state, probably caused by
all of the extra load from new support requests due to the other failures. Their
on-call developer forced the machine to reboot, which brought it back up but its
database came back online corrupted. A reload of the latest database backup was
done, bringing it back online and allowing people to tell us that everything
else was broken.

At about 8AM developers started arriving in the office, allowing our pair of
on-call developers to hand over to some fresh minds and recuperate with some
well deserved sleep. They wrote up a gist containing the night’s IM conversation history, a list of known problems and what had been figured out to
date. That made it much simpler to bring each developer up to speed as they
arrived in the office.

We discussed the situation and organised to have pairs of developers
working to concurrently diagnose and fix the many problems we were seeing.

Buy Now Purchase Problems

The investigation into our buy now purchase problems discovered that inside the
database transaction, just after the system completed a buy now purchase and
acknowledged it with PayPal, it incremented the sales count on the item.

This seemingly innocent action triggered a callback to re-index the item in
Solr. Solr responded very slowly, slower than our maximum allowed time for a web
response, causing the request to be dropped and the database transaction to be
rolled back.

The immediate fix for this was to push the Solr re-indexing into Delayed Job so
that it no longer mattered how long indexing took, the purchases would
complete quickly.

The guys working on this problem then devised a way to find the buy now
purchases affected by this, re-verify with PayPal that they had made the charge,
and complete the purchase at our end giving buyers access to their items.

Item Processing and Asset Uploads

We examined the contents of the Delayed Job queue and it immediately became
obvious why our asset upload jobs were not completing, and why submission
processing had stopped. Solr re-index jobs had been queued with default
priority, which happens to be higher priority than our item submission
processing, S3 uploads etc.

The Solr re-index jobs would tie up a worker until the Solr server
responded or completely timed out, until eventually every worker was tied up
trying to update Solr. This exacerbated the load situation on our Solr server,
and stopped our Delayed Job queues from carrying out useful work.

We fixed this by temporarily lowering the priority on this type of job, so that
any other type of job was worked off before workers started getting blocked
trying to update Solr.

Review Queue Errors

Finally, we examined the review queues and noticed that on approval of an item,
its de-normalized representation is updated or created. This triggers a
callback to index the item in Solr in-request, which again was taking so long
that the request would time out and the reviewer would get an error page.

Root Cause

The final piece of the puzzle, why the Solr server was slowing down so
dramatically, wasn’t solved in such a satisfying manner. Load looked normal and
individual requests to the machine were responded to normal times.

After poking around with it for the better part of the day, and still unable to explain
why it was behaving the way it was, we decided to reboot the server “just in case”.
After the reboot, we let traffic from a single web server through to Solr and
found that performance seemed normal again, so we slowly ramped up how many web
servers we allowed through to Solr until it was again serving full load with
normal response times.

Thus proving you should always try turning it off and on again.

Various theories floated around the office as to the root cause; was it the leap
second from 7 months earlier,
suddenly causing problems on a machine that was never rebooted? Had we crossed
some invisible point above which load becomes unstable? Was it caused by cosmic
rays?

We’re sad to say we’re still not sure. Unexpected failures can and do happen no
matter how much engineering you do. This event highlights our need to do better
at handling those failures.

Preventing it from Happening Again

The take-home lesson for us is that we need to engineer the Marketplaces to be
better isolated from the services they depend on, even the ones that are on our
internal network and run by us. Solr may slow down on us again in the future (and
we’ll be much quicker to try rebooting the server if it does), but it
shouldn’t affect the Marketplaces.

This incident really highlighted how a slow response from a service can be
worse than no response at all. There’s some great sections on why this is and
how to defend against it in “Release it!” by Michael
Nygard, specifically chapters
4.9,
5.1
and
5.2.

We’ve since taken the time to prevent a repeat situation by wrapping the calls
to modify Solr’s index with much shorter timeouts, allowing web requests to
finish in a timely manner even if Solr is misbehaving, and circuit
breakers so we
don’t waste time making calls to Solr when it’s already responding too slowly
to be useful.

We’re taking this lesson and applying it to all of our services, to hopefully
prevent any similar failures in the future.