Tools and insights for a data scientist

Big scale data analysis has been around for years, but only recently it started to be recognized in industry as a valuable mechanism to foster a bunch of business processes. In this publication we tried to collect some of numerous open resources accessible by everybody working in the field of data mining.

Part 1, by Ryan Swanstrom, lists the most important paper in the field of data science.

Part 3, by Greg Reda, tells how data analysis can be done using basic Unix commands.

Part 1. 7 Important Data Science Papers

It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.

Google Search

PageRank – This is the paper that explains the algorithm behind Google search.

Hadoop

MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.

Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.

NoSQL

These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scalable.

Bonus Paper

About the author: Ryan Swanstrom lives in South Dakota, USA and he is a full-time web developer building data products and blogging about learning data science.

Part 2. 50+ Open Source Tools for Big Data

It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.

“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.

Part 3. Useful Unix commands for data science

Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.

How would you do it?

Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete. A database and SQL would be fairly quick, but then you’d have load the data, which is kind of a pain.

Thankfully, the Unix utilities exist and they’re awesome.

To get the sum of a column in a huge text file, we can easily use awk. And we won’t even need to read the entire file into memory.

Let’s assume our data, which we’ll call data.csv, is pipe-delimited ( | ), and we want to sum the fourth column of the file.

Use the cat command to stream (print) the contents of the file to stdout.

Pipe the streaming contents from our cat command to the next one – awk.

With awk:

Set the field separator to the pipe character (-F “|”). Note that this has nothing to do with our pipeline in point #2.

Increment the variable sum with the value in the fourth column (). Since we used a pipeline in point #2, the contents of each line are being streamed to this statement.

Once the stream is done, print out the value of sum, using printf to format the value with two decimal places.

It took less than two minutes to run on the entire file – much faster than other options and written in a lot fewer characters.

Hilary Mason and Chris Wiggins wrote over at the dataists blog about the importance of any data scientist being familiar with the command line, and I couldn’t agree with them more. The command line is essential to my daily work, so I wanted to share some of the commands I’ve found most useful.

For those who are a bit newer to the command line than the rest of this post assumes, Hilary previously wrote a nice introduction to it.

Other commands

head & tail

Sometimes you just need to inspect the structure of a huge file. That’s where head and tail come in. Head prints the first ten lines of a file, while tail prints the last ten lines. Optionally, you can include the -N parameter to change the number of lines displayed.

wc (word count)

By default, wc will quickly tell you how many lines, words, and bytes are in a file. If you’re looking for just the line count, you can pass the -l parameter in.

I use it most often to verify record counts between files or database tables throughout an analysis.

wc data.csv # 377 1697 17129 data.csv wc -l data.csv # 377 data.csv

grep

Grep allows you to search through plain text files using regular expressions. I tend avoid regular expressions when possible, but still find grep to be invaluable when searching through log files for a particular event.

There’s an assortment of extra parameters you can use with grep, but the ones I tend to use the most are -i(ignore case), -r (recursively search directories), -B N (N lines before), -A N (N lines after).

sed

Sed is similar to grep and awk in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file. The usual occurrence is when I’ve received a CSV file that was generated on Windows and my Mac isn’t able to handle the carriage return properly.

sort & uniq

Sort outputs the lines of a file in order based on a column key using the -k parameter. If a key isn’t specified, sort will treat each line as a concatenated string and sort based on the values of the first column. The -n and -r parameters allow you to sort numerically and in reverse order, respectively.

Sometimes you want to check for duplicate records in a large text file – that’s when uniq comes in handy. By using the -c parameter, uniq will output the count of occurrences along with the line. You can also use the -dand -u parameters to output only duplicated or unique records.

While it’s sometimes difficult to remember all of the parameters for the Unix commands, getting familiar with them has been beneficial to my productivity and allowed me to avoid many headaches when working with large text files.