Amazon engineers knew about the Tōhoku earthquake before most of the western
world did. They knew because within seconds of the earthquake, their pagers
started going off (yes, almost everyone at Amazon has a pager). Their internal
monitoring detected a giant drop in the total number of orders that should have
been getting processed in Japan and seconds later alarms were going off
left and right.

I remember greatly enjoying the details of this when one of my good friends,
former Amazonian, and current Thinkfuser Mark Golazeski
told me that story. It reminded me of another that I heard when I was working
at Google.

Back in the day, Google was colocating space in a datacenter and, as with all
things at Google, they had exhaustive logs recording every little detail. One day,
their monitoring triggered an alarm when one of the servers reported abnormally
high temperatures. As they watched the logs, they were confident that this
wasn't a glitch in their monitoring systems. This was supported by more
evidence as they noticed surrounding servers starting to report abnormally high
temperatures as well. Suddenly, some of the servers started going offline. They
watched as it spread server to server. Was it a virus? No. They concluded that
their rack was on fire.

The Google engineers called the staff at the data center to let them know that
there was a fire and they were watching it spread in real-time.
The staff laughed and countered that they were sitting in
the data center and would surely know if it was on fire! They told Google that
their monitoring software was busted.
After a couple more seconds of back-and-forth, sure enough the
data center's fire suppression system kicked in. The data center staff
hurriedly said 'Oh my god, sorry!' and immediately ran off the phone without another
word. That's the story of how Google engineers knew a data center was on fire
before the data center knew.

This philosophy of having intense monitoring around every system is something we
heavily embrace at my startup, Thinkfuse. It's
great for monitoring. It's great for debugging. But it's biggest strength is
that it's great for turning horrible customer experiences into phenomenal ones.
Let me explain.

In our early days, we would constantly break features while working on new
ideas. We warned our users that they were using a prototype that might break
frequently, but they started using us for real work anyway and came to depend
on us.

We decided that we wanted to be very proactive about errors on the site, so we
set up a system to directly email us every time an error occurred. We then
personally responded to every single error that any user encountered, often
within minutes.
Our users loved these responses and it gave them a lot more confidence in
using Thinkfuse, despite it being a fairly buggy prototype at the time. They
also became much more forgiving of future problems, so much so that almost all
of those users are still using us today (2 years later!).

Great customer service buys you a lot of slack. I can't understate that. The
reaction that a user has when you email them about an error they experienced,
when they never even contacted you about it, is amazing. It turns a horrible situation
that might cause the user to walk away and never come back into a situation
where they're shouting from a mountain top about how great you are. One of my
favorite tweets was written by Scott Kveton, the CEO of UrbanAirship, when we reached
out to him last year after we caught an error. Here is his reaction:

He still uses us to this day.

Our policy of reaching out to users is so invaluable, even today
with hundreds of companies using us and 35% month-over-month growth, we still
respond personally to every single error we detect as quickly as we can.

With a team of five, this policy has a secondary benefit of forcing us to be
better about automated testing and code quality. We couldn't do it if errors
scaled with our growth.

Businesses store critical data in our system and we try our best to make
sure that errors never happen. The product has also matured quite a bit in the past
two years. If a problem does slip by though, we make sure
that the customer knows we're paying attention and on top of it. Reaching out
to them first is the best first step to building credibility and
turning a poor customer experience into a great one.

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse.
Contact me at steve@thinkfuse.com.