Aggregating & plotting time series in python

by yhat

November 3, 2013

Time based data can be a pain to work with--Is it a date or a datetime? Are my dates in the right format? Luckily, Python and pandas provide some super helpful utilities for making this easier. In this post, we'll be using pandas and ggplot to analyze time series data.

Data set

For these examples, we'll be using the meat data set which has been made available to us from the U.S. Dept. of Agriculture. It contains metrics on livestock, dairy, and poultry outlook and production.

You can find the data set in either the ggplot package or the pandasql package, both of which are installed via pip.

Working-with-dates-and-times-with-pandas">Working with dates and times with pandas

pandas has some excellent out of the box functionality for aggregating date and time based data.

ts.groupby(ts.index.year).sum().head(10)

beef

veal

pork

lamb_and_mutton

1944

8801

1629

11502

1001

1945

9936

1552

8843

1030

1946

9010

1329

9220

946

1947

10096

1493

8811

779

1948

8766

1323

8486

728

1949

9142

1240

8875

587

1950

9248

1137

9397

581

1951

8549

972

10190

508

1952

9337

1080

10321

635

1953

12055

1451

8971

715

Since we indexed our data on a datetime column (date), we can group by the year and take the sum over the columns pretty easily.

But what if we're keen to look at the sums over the decades?

Grouping by decade

If you're only interested in one or more specific decades, you can accomplish that using the date and time slicing functionality baked-in to pandas. Here we selected a slice of the data corresponding to the 1940s.

Things are starting to make sense. Now how might we better inspect the trends we're seeing over time? Well one way we could do it is by using the same bar chart as before, but stacking the values for each type of livestock.

For all you ggplot2 fans wondering why we didn't do a stacked bar chart--don't worry! It's coming in a release in the not so distant future.

Trends over time

For our last plot we're going to jump back a little bit. Instead of looking at the data in aggregate, we're going to take another approach to making sense of our time series data. We're going to bring the original meat dataset back into the mix so we can take a look at all of our livestock varieties.

Ok so this plot looks a bit cluttered. We've got way too much zigging and zagging. Sure the colors are nice, but it's a bit overwhelming.
Instead of getting rid of our data, we're going to apply a smoothing function so that we'll see the trend instead of the noise.

Ahh, much better. This plot I can actually make sense of. You can see that chicken production has been growing quickly since the late 1950's, and that sometime in late 1970s/early 1980s it overtook pork production, and a few years later it overtook beef production.

We're still working out some of the kinks in stat_smooth, but you can see that it's already an incredibly useful function. If you're interested in helping build ggplot for Python, drop us a note at info@yhathq.com! We'd love to hear from you.