When you want to test some “real” stuff, it can be useful to have a Hive table with some “big” data in it.

Let’s build a table based on Wikipedia traffic stats. It will contains page counts for every page since 2008, allowing us to have partitions too.

These stats are available here so let’s grab those files first.
This is a lot of files, here we’re getting only 2008/01 files, which are each between 20 and 30MB.

I installed axel which is a download accelerator, available on EPEL repo or with the rpm

[vagrant@gw ~]$ sudo rpm -ivh http://pkgs.repoforge.org/axel/axel-2.4-1.el6.rf.x86_64.rpm
[vagrant@gw pagecounts]$ for YEAR in {2008..2008}; do for MONTH in {01..01}; do for DAY in {01..08}; do for HOUR in {00..23}; do axel -a http://dumps.wikimedia.org/other/pagecounts-raw/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz; done; done; done; done