Hadoop is an open-source software framework for distributed storage and distributed processing of very large datasets. All the modules in Hadoop are designed with an assumption that hardware failures should be automatically handled by the framework.

At the core of Apache Hadoop is a storage component known as Hadoop Distributed File System (HDFS) and a processing component called MapReduce. Hadoop splits files into large blocks so that they can then be distributed across nodes in a cluster. By distributing the files across many nodes, processing times are significantly improved because no single node has to handle a large file.

In this article, we’ll walk through the process of integrating Hadoop and Python by moving Hadoop data into a Python program.

HDFS and YARN

Let’s start by defining the terms.

HDFS

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It's the file system supporting Hadoop.

YARN

YARN is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of a user application. The fundamental objective of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

The ResourceManager and the NodeManager form the data computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent responsible for containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager/Scheduler.

The YARN resource manager also offers a web interface.

MRJob and Real-World Applications

While Hadoop streaming is a simple way to do MapReduce tasks, it's complicated to use and not really user-friendly when things fail and we have to debug our code. In addition, if we wanted to do a join from two different sets of data, it would be complicated to handle both with a single mapper.

MRJob is a library written and maintained by Yelp that allows us to write MapReduce jobs in Python. It has extensive documentation and allows for the serverless application of your code for testing.

Here's the word count MapReduce, a commonly used example program for demonstrating MapReduce Logic, rewritten using MRJob:

What if we wanted to perform a calculation that involves multiple steps? For example, what if we wanted to count the words in documents stored in our database and then find the most common word being used? This would involve the following steps:

Hadoop and MRJob have plenty of versatility to help you answer numerous questions in your dev environment. There are other mappers you can use, but MRJob’s documentation, logging capacity, and ability to function without converting your Python code make it an ideal option. See the documentation below for inspiration on how you can implement the two into your data science arsenal.