Elephants, war rooms and 30 petabytes – this is what it takes to keep Facebook’s data in check according to one of its most recent blog posts. After building out its new data center in Prineville, Oregon, which opened in April this year, the Internet giant then had to find a way of migrating what is said to be the largest HDFS Hadoop cluster in the world into it. Facebook uses Hadoop – a program named after a toy animal belonging to a developer’s child — to mine and manipulate its data sets, and distribute its storage across switches and stacks. As Hadoop use grows, however, so too does the power and infrastructure required to accommodate it.

Facebook Engineer Paul Yang recently highlighted the challenges Facebook had when moving the Hadoop application stack to its newest data center. “We considered a couple of different migration strategies,” Yang said. “One was a physical move of the machines to a new data center – we could have moved all the machines within a few days with enough hands at the job.” Such a move, however, would have required Hadoop to be down for a period of time that could have been too long for analysts and users of Facebook.

The solution was a complicated replication system set up to mirror changes from the old cluster to the new, larger cluster, which at switchover time could easily be redirected to the new cluster. “This approach is more complex as the source is a live file system, with files being created and deleted continuously,” Yang said. This was done in two steps. First a bulk copy transferred most data from the source cluster to the destination using DistCp, a tool used for cluster copying. “Our Hadoop engineers made code and configuration changes to handle special cases with Facebook’s datasets, including the ability for multiple mappers to copy a single large file and for the proper handling of directories with many small files,” Yang said. A custom Hive plug in was then used to detect file changes and add these to an audit log.

“At the final migration switchover time, we set up a camp in a war room and shut down Hadoop JobTracker so that new files would not be created. Then the replication system was allowed to catch up,” Yang said. DNS entries were then changed so hostnames referenced by Hadoop jobs pointed to servers in the new cluster.

The project was not, however, without its challenges. Yang said simply designing a system to handle the size of the replication project, given the huge amounts of data it was dealing with, put the team to the test. Facebook also had to ensure that replication of the system was fast enough to allow for corrupt files to be fixed without eating into the team’s schedule. “If the replication process could just barely keep up with the workload, then any recopy could have resulted in missed deadlines.

The result has not only been a shift to the new data center, where infrastructure is built out in a more efficient manner and space is available for further scaling out, but a new way of building in redundancy, according to Yang. “The replication system also demonstrated a potential disaster-recovery solution for warehouses using Hive,” Yang said. “Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data recovery functionality. “The replication system could increase the appeal of using Hadoop for high reliability enterprise applications.”

Moving IT operations to the cloud saves money when compared to using on-site computing facilities. That’s a premise that’s increasingly believed, but the truth of which actually depends on a number of parameters. A group of academics from the Pennsylvania State University looked at the economics of cloud computing and concluded that cloud computing currently makes the most sense economically for small businesses and applications with small workloads.

IT オペレーションのクラウドへの移行は、オン･サイトのコンピューティング･ファシリティ利用と比較して、コストを節約する。 それは仮説であり、その信頼性を高めているが、何が真実かといえば、数多くのパラメータに依存するという現状にある。 Pennsylvania State University のアカデミック･グループは、クラウド･コンピューティングの経済的側面に注目し、現状では、経済的に小規模なビジネスと、低いワークロドのアプリケーションに対して、意味をもたらすという結論に至った。

The researchers divided costs into categories of being “quantifiable” and “less quantifiable”, and “direct” and indirect. Given this division, they constructed a matrix showing the classification of various costs.

The report considered the two option of pure in-house and pure cloud-based hosting, but it also considered combinations of the two approaches, which it termed as “hybrid” options. In the hybrid model, they identified vertical partitioning and horizontal partitioning. With vertical partitioning an application might be split so that part of it reside on-premise and another part of it resides in the cloud. Horizontal partitioning replicates some parts of an application in the cloud and, as load on the application increases, usage can burst or spill over onto servers running in the cloud.

The report found that “ (i) complete migration to today’s cloud is appealing only for small/stagnant businesses/organizations, (ii) vertical partitioning options are expensive due to high costs of data transfer, and (iii) horizontal partitioning options can offer the best of in-house and cloud deployment for certain applications.”