Tech from Ashfaq Afzal Chowdhury (FAHIM)

Menu

About Me

I'M from Dhaka,Bangladesh.I'M a simple boy from a simple family and I M a student,My favourite subject is Physics And my favourite game is football and cricket.I have a keen interest in computers and other technologies.I really like video games.being a muslim human I always believe in Allah.

aAc Fahim

Social Icons

Recent

Labels

Sunday, August 3, 2014

Yesterday afternoon Facebook experienced the worst outage that the
company has had “in over four years”, causing the site to go down for
most users for “approximately 2.5 hours”. One of the
company’s engineers followed up with a blog post, explaining exactly
what went wrong. The cause of the issue sounds relatively complicated,
however the conclusion was that the company had to restart the entire
site.

According to Robert Johnson:

The key flaw that
caused this outage to be so severe was an unfortunate handling of an
error condition. An automated system for verifying configuration values
ended up causing much more damage than it fixed. The intent of the
automated system is to check for configuration values that are invalid
in the cache and replace them with updated values from the persistent
store. This works well for a transient problem with the cache, but it
doesn’t work when the persistent store is invalid. Today we made a
change to the persistent copy of a configuration value that was
interpreted as invalid. This meant that every single client saw the
invalid value and attempted to fix it. Because the fix involves making a
query to a cluster of databases, that cluster was quickly overwhelmed
by hundreds of thousands of queries a second. To make matters worse,
every time a client got an error attempting to query one of the
databases it interpreted it as an invalid value, and deleted the
corresponding cache key. This meant that even after the original problem
had been fixed, the stream of queries continued. As long as the
databases failed to service some of the requests, they were causing even
more requests to themselves.

Now come to Fahim (Admin):

If
you don’t understand what he’s talking about, it’s ok. Most people
probably don’t understand what went wrong for the most part, however it
sounds as though the site went into one of those infinite loops of
death. While you don’t need to be an advanced programmer to understand
how bad infinite loops are, you definitely need to have some engineering
know-how. The bottom line is that it was one of the worst crashes
the company has ever experienced and they are working on making it so
that never happens again!