Wednesday, June 26, 2013

It seems like it should be a simple thing: create an empty DataFrame in the Pandas Python Data Analysis Library. But if you want to create a DataFrame that

is empty (has no records)

has datatypes

has columns in a specific order

...i.e. the equivalent of SQL's CREATE TABLE, then it's not obvious how to do it in Pandas, and I wasn't able to find any one web page that laid it all out. The trick is to use an empty Numpy ndarray in the DataFrame constructor:

Tuesday, June 18, 2013

Sparklines are tiny graphs added inline to data tables or lists. Sparklines have no units or gridlines; the purpose is just to quickly convey the shape of the data, which is usually presented adjacent.

A powerful variant on sparklines are sparkline histograms, or "sparkgrams" as I like to call them. Sparkgrams can quickly and compactly convey data distributions: outliers, right/left skew, normal vs. uniform distribution, etc. Because they are small, they can be included right inside a data table for every column or row. Such a presentation is useful either for completely known data when embarking upon data analysis, or for newly received data of a known data type in, for example, a manufacturing quality control scenario.

Below is an example of how it might look. The first column demonstrates an outlier, while the second column conveys a normal distribution.

Thursday, June 13, 2013

iPython Notebook together with pandas comprise a data analysis system that is essentially a clone of R. The advantage over R is that Python code can be more easily converted into production code and executed, for example, on a web server.

One limitation is that the Apache-recommended Python interface to Hive requires installing Hive locally, which can be problematic or inconvenient from, say, a laptop used for analysis. A nice alternative is remotely executing Hive on the Hadoop cluster (or adjacent Linux server with access to the cluster). The alternative does require installing sshpass. At the end of the code below, the query results are stored in a pandas DataFrame with column names and automatic detection of data types.

UPDATE 2013-08-01: It's a bit cleaner if one uses Python's triple-quoting mechanism as below. This allows one to copy and paste queries between iPython Notebook and Hive and Hue without having to reformat with quotes and backslashes.