Blog

According to IBM 90% of the world’s data has been created in the last two years (note: Nicki Minaj got a twitter account 2 years ago. Coincidence? I think not). Of this ever-expanding sea of data, 80% of it is “unstructured” (well, I’ll say!). Sheer size alone makes analysis of petabytes of information difficult, but it's harder still due to thecomplexity of unstructured data. In the age of big data we’re faced with not only storage issues but also computational power restrictions. The easy solution would be to throw out some of Minaj’s less-erudite tweets, but companies and their data scientists have concluded that there is gold in them thar hills--they just need to devise ways to store and process it.

Today, many data scientists agree that best way to extract value from Nicki Minaj’s and the other 100 million users' tweets is by using Apache Hadoop (or one of its commercial variants). The result of a beautiful marriage between infrastructure and programming model, Hadoop is the solution to our big data woes! In 2004 Google published a paper entitled “MapReduce: Simplified Data Processing on Large Clusters” (great beach read, btw) which outlined the MapReduce programming model that allows users to process and generate data sets from distributed file systems. A MapReduce job has two phases, unexpectedly named "Map" and "Reduce." In the Map phase, a dataset to be queried (e.g. from Minaj's tweets, pull all strings of consonants longer than 8 consonants) is collected, chopped up, and assigned to the fleet of plebeian servers. In the Reduce phase, those little Minaj analyses are reunited as one big output. There are: 2557 strings of consonants longer than 8.

MapReduce paired with the Hadoop Distributed File System (aka HDFS, aka the files that run on fleet of plebeian servers) is the fastest most efficient way for people, companies, and governments to analyze and store data.

Little known companies like “Facebook” and “Yahoo” are now utilizing Hadoop to warehouse their data. Keeping in step with the ostensible value of data, one might be wise to an eye out for this “Facebook,” as they’re cranking out ½ a petabyte daily! With that much “valuable” data, chances are they’re making a killing for their shareholders!

In a typical Hadoop slave node, a node running as a datanode and tasktracker, it is typical to provide 75% of the disk for HDFS storage and the remaining 25% for MapReduce intermediate data. The MapReduce intermediate data is the data created after a map task has run over an input split, typically a HDFS block. Given a single disk, this is simple task, but with multiple slavenode disks the decision becomes more complicated. We want to choose the best disk configuration to maximize all available resources.

Assume we have 4x 1TB disks in our example slavenode.

The logical assumption would be to assign 3x 1TB disks for HDFS storage and 1x 1TB disk for MapReduce intermediate data storage. The problem with this approach is that we sacrifice potential HDFS throughout by assigning one full volume to only store MapReduce intermediate data.

The better approach is to store both HDFS and MapReduce intermediate data on each disk on the slave node. This can be accomplished a few different ways. One way would be to use separate partitions, but using this method would leave you stuck if you ever needed to change the percentage split (e.g. allocated more storage for HDFS or MapReduce intermidate data). Another way is to use the dfs.datanode.du.reserve configuration property in hdfs-site.xml to control the split by specifying separate directories for each on the same volume. This would allow you to modify the capacity available to HFDS on the fly after a namenode restart.

Another inserting solution, I heard a classmate say, is to use the native Linux file system user disk quota system. This should work because the MapReduce daemon and HDFS daemon both run under separate user accounts, but I’m not sure how thoroughly it has been tested.

I recently attended Cloudera’s Hadoop Training for Administrators course in Columbia, MD. You can read my recap of the first and second days here and here.On the third day, we covered cluster maintenance, monitoring, and benchmarking, job logging, and data importing. Cluster Maintenance

HDFS clusters can become unbalanced when new nodes added to the cluster leading to performance issues.

Clusters can be rebalanced using the balancer command, which adjusts block placements on nodes within a set threshold value. The balancer command should only be used after adding new nodes to a cluster

Namenode backup

Single point of failure (at this point in time).

If namenode metadata is lost then the cluster is lost.

The fsimage and edits file are the two primary files that write metadata on disk. The fsimge file doesn’t write every change to HDFS file metadata onto disk; rather it appends changes incrementally to the edits log.

At startup, the namenode loads the fsimage into ram then replays all entries from the edits log.The two logs are merged at set intervals on the Checkpoint node (aka secondary name node). This node copies both files loads them into RAM, merges them, then copies the results back to the Namenode.

The checkpoint node does store a copy of these two files but depending on the last merge the data could be stale. It’s not meant to be a failover node, but more of a housekeeper.

Wrigley recommends writing to two local directories on difference physical volumes and to a NFS directory. You can also retrieve copies of the namenode meta data over HTTP:

fsimage: http://<namenode>:50070/getimage?getimage=1

edits: http://<namenode>:50070/getimage?getedit=1

Cluster monitoring and Troubleshooting

Use a general system monitoring tool like Nagios, Cacti, etc.. to monitor your cluster.

.log is the first port of call for diagnosis issues. It uses log4j. The Configuration for log4j is stored at conf/log4j.properties.

Logs are rotated daily.

Old logs are not deleted.

.out is the combination of stdout and stderr, and doesn’t usually contain much output.

Appenders are the destination for log messages

The one that ships with Hadoop, DailyrollingFileAppender, is limited. It rotates the log file daily, but doesn’t limit size of log or number of files—the admin has to provide scripts to manage this.

CDH ships an alternate appender called RollingFileAppender that addresses the limitation of the default appender.

Job logs created by Hadoop

When a job runs 2 files are created, the Job XML configuration file and the job status file.These files are stored in multiple places on local disk and in HDFS:

Hadoop_log_dir/<job_id>_conf.xml (default is 1 day)

Hadoop_log_dir/history (default is 30 days)

<job_output_dir_in_HDFS)/_logs/history (default is forever)

Jobtracker also keeps them in memory for a limited time

Developer logs are stored in the Hadoop_log_dir/userlogs (location is hardcoded). You should be wary of large developer logs files, as they can cause salve nodes to run out of space. By default, dev logs are deleted every 24 hours.

You should test clusters before and after adding nodes to establish a baseline. It’s also good to do before and after upgrades.

Cloudera is working on a benchmarking guide.

Populating HDFS from External Resources

Flume is a distributed, reliable, available service for moving large amounts of data as it is produced.It wascreated at Cloudera as a spinoff of Facebook’s Scribe.

Flume is ideally suited for gathering logs from multiple systems as they are generated.

It’s configurable through a web browser or CLI., and can be extended by adding connectors to existing storage layers or data platforms.

General sources already provided include data form files, syslog, and stdout from a process.

Wrigley said there were some latency issues with Flume that are being fixed in the next minor version.

SQOOP is the SQL-to-Hadoop database import tool. It was developed at Cloudera, is open-source, and is included as port of CDH (It’s about to become a top level Apache project).

Sqoop uses JDBC to connect to RDBMS.

It examines each table and automatically generates a Java class to import into HDFS then creates and runs a Map-only MR job to import the data. (Aside: per Mike Olson, you would have to be crack-pipe crazy to run MapReduce 2 in production.)

By default, four mappers connect to the RDBMs, and each imports a quarter of the data.

Sqoop features:

Imports a single table, or all tables in a database

Can specify which rows to import with a WHERE clause

Can specify columns to import

Can provide an arbitrary SELECT statement

Can automatically create a Hive table based on imported data

Supports incremental imports of data

Can export data from HDFS back to a database table

Cloudera has partnered with third parties (Oracle, MicroStrategy, and Netezza) to create native Sqoop connectors that are free but not open source.

MicroStrategy has their own version of SQOOP for SQL server derived from SQOOP open source.

Best practices for importing data

Import data to an intermediate directory in HDFS; then once it’s completely uploaded in HDFS, move it to the final destination. This prevents other clients from believing the file is there until it is completely there and ready to be processed.

Installing and managing other Hadoop projects

Hive metabase should be stored in RDBMS such as MySQL. This is a simple configuration:

Create a user and database in RDBMs

Modify hive-site.xml on each user’s machine to point to the shared Metastore

I recently attended Cloudera’s Hadoop Training for Administrators course in Columbia, MD. You can read my recap of the first day here.In the second day, we learned how to deploy a cluster, install and configure Hadoop on multiple machines, and manage and schedule MapReduce jobs.

1. Deploying your cluster:

There are 3 different deployment types:

Local (dev)

No daemons, data is stored on the local disk no HDFS, everything runs in a single JVM

CDH is avalaible in multiple formats (packages and tarballs). Packages are recommended because they include some features not in the tarball.

Startup scripts can also install the hadoop native libraries. They provide better performance for some Hadoop components.

Hadoop includes scripts called start-all.sh and stop-all.sh. These connect to TTs and DNs to start and stop daemons. Don’t use these scripts, as they require passwordless SSH access to slave nodes.

Hadoop does not use SSH for any internal communications.

You can verify installation with sample jobs shipped with Hadoop

Use CD Manager tool for easy deployment and configuration

Hadoop Configuration Files:

Each machine in the cluster has its own configuration files that reside in /etc/hadoop/conf

Primary files are written in XML

.20 onwards configurations have been separated out based on functionality

Core properties: core-site.xml

HDFS properties: hdfs-site.xml

MapReduce properties: mapred-site.xml

Hadoop-env.sh sets some environment variables.

A lot of the default values are performance stale because they are based on a couple year old hardware specs.

Best configuration practices are still in the works b/c of the age of Hadoop. Cloudera hopes to have a guide out sometime this year.

Rack awareness:

Distributes HDFS blocks based on a host’s location

Important for cluster reliability, scalability, and performance.

The default setting places all nodes in a single rack /default-rack

Script can use a flat file, database, etc.

A common scenario is the name your hosts back on rack location so a script can simple deconstruct the name to find the host’s location.

You can use IP addresses of names to indentify nodes in Hadoop’s configuration file.

Most people use names rather than IPs

Cluster daemons generally need to be restarted to read in configuration file changes

Configuration Management tools:

Use configuration management software to manage machine configuration changes at once

Start early; retrofitting these changes is always a pain.

There are alternatives to using Cloudera Manager like Puppet and Chef

3. Managing and Scheduling jobs

You can control MapReduce jobs at the command line or through the web interface. We went over two job schedulers:

FIFO is the default scheduler shipped with Hadoop. With FIFO, Jobs run in order submitted, but this is generally a bad idea when multiple people are using the cluster. The only exception to this is when the developer specifies the priority, but if all jobs are given the same priority then you get the idea.

Fair Scheduler is shipped with CDH. Fair scheduler is designed for multiple users to share the cluster simultaneously. MapReduce tasks are split up into different pools and are allocated based on pool configuration parameters.