Posted
by
Soulskill
on Friday November 09, 2012 @12:43PM
from the named-after-a-desire-to-launch-hadoop-into-the-sun dept.

Nerval's Lobster writes "Facebook's engineers face a considerable challenge when it comes to managing the tidal wave of data flowing through the company's infrastructure. Its data warehouse, which handles over half a petabyte of information each day, has expanded some 2500x in the past four years — and that growth isn't going to end anytime soon. Until early 2011, those engineers relied on a MapReduce implementation from Apache Hadoop as the foundation of Facebook's data infrastructure. Still, despite Hadoop MapReduce's ability to handle large datasets, Facebook's scheduling framework (in which a large number of task trackers that handle duties assigned by a job tracker) began to reach its limits. So Facebook's engineers went to the whiteboard and designed a new scheduling framework named Corona."
Facebook is continuing development on Corona, but they've also open-sourced the version they currently use.

Seriously, the Job Tracker just didn't scale well and applications had to worry about it - that's a broken architecture, not a broken application or deployment. Blaming the application or deployment for serious fundamental architectural flaws of the platform is much like blaming an application programmer in 1980 for using a=a+1 which a compiler happened to implement less efficiently than a++ or even a+=1 (or, for you old timers, a=+1 not to be

But between you and 1000 other people who care about slightly different sets, much of it is stuff that someone cares about.

This. 99.9% (at least) of the entire internet is junk that any one person doesn't care about. But every bit has someone who cares about it (or did at one time) or it wouldn't be there.

Well. I opened the story expected some reflexive Facebook-bashing, and I wasn't disappointed. When are people going to realize that FB's just another internet company with a reasonably successful business model, and worthy of neither adulation nor hatred?

This. 99.9% (at least) of the entire internet is junk that any one person doesn't care about.

I've done a crawl of a few billion pages.

No person at all cares about 99% of the content available on the interent. In fact, nearly that much is completely unreadable and was machine generated gibberish (real words, not sentences) in an attempt to fool Google and other search engines.

There are a few servers which host millions of subdomains with millions of manufactured pages under each subdomain.

When are people going to realize that FB's just another
internet company with a reasonably successful business model, and worthy of
neither adulation nor hatred?

Wrong. FB is worthy of hatred because what they do is inherently evil. They spy on people, and sell off that information.

The "it's just a job/business" excuse doesn't work when the job/business is evil.
For example, when the local Mafia goons come to collect protection money, it's "just a job" for them right? Nothing personal. They're just

What you say is sort of true, but I disagree that it is inherently evil. Evil implies a malicious intent. At worst, it's simply sociopathic. Facebook is doing what it's doing so that it can make money, and it's methods arn't even remotely secret. They would have no power at all if it wasn't handed to them gleefully by people.

Further it's disingenuous to compare them to the mafia and similar, for one simple reason. The mafia does what it does against people who are unwilling participants. Facebook on th

Hadoop: massive data storage system framework... "Apache Hadoop is an open-source software framework that supports data-intensive distributed applications"MapReduce: a way of managing distributed clusters of data sets... "MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers"

Scheduling framework: a framework for providing optimal scheduling of something such t

No snark intended... no sarcasm given. The terms describe things that are technical. If you want something more generic, I could go as far as "Database management architecture" and "database communication architecture" but that dumbs things down to the point where it ads nothing to the discussion. If you don't understand what a database is and how it works (and that we're talking about database management here), you're going to find this entire article over your head, not just the industry buzzwords.

after paging through the code a bit, i found it interesting that they use java in their implementation (not just corona, but hadoop as well). i was wondering why, and after some googling found this link [nabble.com] which helped explain the situation a bit clearer.

pretty interesting stuff. but id be willing to bet googles map reduce is written in c/c++

Until you process petabytes of data and suddenly you it makes a difference of a couple of hours per run. All the coll dynamic web technology is really nice and empowering, but once you start hitting real traffic, it makes sense to invest into more efficient core systems. See G-WAN [gwan.com] for how to do it right.

I have to admit, while I hate using Facebook, and hate most of their business practices, I like how they're not just writing new infrastructure software, but are open-sourcing it all. I don't think it quite makes up for everything else, but it helps.

They could start by actually deleting deleted content. Seems simple to me. Lets hope their shortsightedness continues when everyone jumps ship for the next social fad, and continuing this rat race becomes far to costly.

They could, but why should they put themselves at a disadvantage over Google, every other corporation that buys such data and the NSA, who all most certainly do not delete stuff in the way you'd like them to?