Tuesday, December 10, 2013

I have been refining printed summaries of data frames for my Common Lisp library cl-data-frame. I found that the following approach works best for me for quick eyeballing of data before any processing or analysis:

Real numbers should be summarized by their range and the three quartiles (25%, 50%, 75%). This provides enough information to assess the variation and the "typical" values of the data.

All other values should be summarized by their count and frequency. This is ideal for categorical data (called "factor" in R), and also for various encodings of missing data.

When the column has both numbers and non-numbers, print both of the above. However, when it has very few distinct numbers, don't use quartiles for numbers, just print the frequencies.