The biggest problem we were having is that when a lot of errors come in at
the same time, it caused the database to back up and slow way down. Even though
saving one particular record wasn’t slow at all, saving thousands at once was
problematic. Turns out that Rails was working against us here: the caching that
normally speeds everything up was causing a bottleneck. Each time an error came
in, we updated the counter cache (and some other caches) on the error’s group
record.

The problem was that it locks the row when that happens, and since a flood of
one particular error is going to want to update that row a lot, it meant that
all the other error notifications had to wait for the one before it to be done.
If you’ve ever stood in line for the bathroom, you know how impatient queueing
can make you.

Thankfully, there’s a way to use the counter caches without actually using the
counter caches. You can have them in the database and even have ActiveRecord
respect them when calling #size, but at the same time not actually update the
columns. Simply pass :counter_cache => false as an option to the association.

Of course this doesn’t get you the actual caching you want. To do that, we now
have a rake task that we run every minute that updates these counter caches.
Fortunately it’s pretty speedy, so we don’t have to worry about it overloading
everything. The task counts up the errors that came in since the last time it
ran and updates the counter caches on the related error groups accordingly.

To assist with identifying performance issues we’ve been using New
Relic. When you host with Engine
Yard, you get a bronze level account free, but we’ve
upgraded our account to Silver in order to get transaction traces.

Transaction traces allow you to see the specific SQL that’s being called for slow action. As a result of working
with New Relic, we were able to identify the fact that an aftercreate callback
was causing an unnecessary extra save of each error, and that a stray call to
`currentuser` could be removed. While neither of these things were huge
problems, it allows us to have the breathing room we need when traffic spikes.

Once we had made those changes, we saw a fairly dramatic decrease in the amount
of latency of the error creation action (the green vertical lines are our
deploys of these changes).

Hoptoad groups duplicate notices so that you don’t get bombarded with e-mails
when a single issue causes hundreds of exceptions. When grouping notices,
Hoptoad identifies unique exceptions based on their error class, file, line
number, action, controller, and Rails environment. This isn’t particularly
complicated behavior, but it resulted in a search for all those properties every
time a new exception came in. Eventually, the indexes for this operation started
to get out of hand, and operations on notice groups started to get expensive.

In order to ease the congestion, we came up with a hashing mechanism to throw
out most of the notices right away when matching. Each notice now has a
fingerprint that results from its unique columns, so most notices can be ignored
as potential matches quickly by comparing fingerprints. This allowed us to
remove several indexes, speeding up inserts and other selects on that table.

Since we’ve made these changes, we have been able to successfully weather
several surges of error traffic, and performance was not adversely affected.

While things are looking good now we’re keeping our eye on the ball and have
other changes in the works to make sure that Hoptoad’s performance keeps pace
with the number of errors you all are creating out there.