Categories

Meta

@magnusljadas

Search for Snow with Hadoop Hive

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.

Just a short update if you’re unfamiliar with Hadoop and Hive. Hadoop is an ecosystem of tools for storing and analyzing Big Data. At its core Hadoop has its own distributed filesystem called HDFS that can be made to span over hundred of nodes and thousand of terabytes, even petabytes, of data. One layer above HDFS lives MapReduce – a simple yet effective method introduced by Google to distribute calculations on data. Hive lives on HDFS and MapReduce and brings SQL capabilities to Hadoop.

Think of your ordinary RDBMS as a sports car – a fast vehicle often built on fancy hardware. An RDBMS can yield answers to rather complex queries within milliseconds, at least if you keep your data sets below a couple of million rows. Hadoop is a big yellow elephant. It has traded speed for scalability and brute force – it was conceived to move BIG chunks of data around. And it can live happily on commodity hardware. For the sake a brevity we’re going to use some rather small data sets – about 1 megabyte each. It won’t even fill a file block in HDFS (64 megabyte). A more realistic example of Hadoop’s capabilities would be something like querying 100 billion tweets. A RDBMS can’t do that.

You can run Hadoop on your local machine. Like me, on an old MacBook Pro using VMWare. Just download the latest image from Cloudera

The Swedish national weather service SMHI provides the data we need: daily temperature and precipitation data from 1961 to 1997, gathered at a weather station about 60 km from where I live.

Logon to your Hadoop instance and open the terminal to download the data:

Notice how Hive transform the SQL-query into MapReduce jobs. We could of course do this ourselves in Java, but we’d be swamped in code. Hive hides the underlying complexity of MapReduce in the form of more convenient and mainstream SQL.

Hive supports subqueries, let’s calculate the average number of snow days: