Read and clean the data

pandas has a read_csv function that can read a CSV to the fundamental pandas
object, the DataFrame. Each DataFrame has an index, which can be thought of a
special column that identifies the rows. It can be generated automatically
(e.g. as a sequence of integers beginning at zero), or you can tell read_csv
to use a field of the source data, which we do below (the 12th column of the
CSV is the unique MOMA ID of the item). We also tell read_csv to treat the
10th column as a datetime, which means it will parse the strings in that column
into python datetimes.

Most of the plots below depend on the DateAcquired field being valid, so
let’s use dropna to drop the 4428 records where this field is invalid or
missing.

moma = moma.dropna(subset=['DateAcquired'])

Classifications and departments

Having loaded the data, we can begin by examining the distribution of items in
the collection by classification.

The chain of pandas operations required to do this has a lot going on in it, so
let’s break it down:

First we use a pandas groupby to group the moma DataFrame by
Classification. This is analogous to a SQL GROUP BY operation.
moma.groupby('Classification') is a DataFrameGroupBy object, which can
be thought of as a list of pandas DataFrames each of which is made by
splitting up the original DataFrame according to the value of
Classification for each row.

You can iterate over this list, but it’s usually more useful to perform an
aggregation on it, i.e. to collapse each DataFrames the DataFrameGroupBy
object into a single row. I just want to know how many items there are in
each class, so I use .size().

moma.groupby('Classification').size() is then a pandas
Series, which we can sort and plot as a horizontal bar graph
(kind='barh').

For obvious reasons, there are many more prints, photographs and books than any
other class of work. If you’re only interested in paintings, sculptures and
installations then records where Department is Paintings & Sculpture
provides a way to select those out.

Artists

Which artists have the most items in the MOMA collection?

We can do this with the same groupby(), size() and sort_values()
operations. The only difference here is that I add a tail() after the
sort(), which gives us a list of the top 20 artists (sort_values() is
ascending by default).

Lots of photographers! What if we only look at the Painting & Sculpture
Department?

To do this, we need to filter the moma DataFrame before we operate on it.
Inside the square brackets is moma['Department'] == 'Painting & Sculpture'.
This is itself a Series, but its values are booleans (True and False. When
this object is used to index a DataFrame (or Series), rows where the boolean
Series is False are filtered out.

Overall trends with time

Looking at patterns with time (rather than by an unordered category like
Department) is tricky, but easier in pandas than it would otherwise be!

We can give groubpy() a Grouper object to group into time intervals. The
constructor for this object takes:

a key keyword which tells the groupby operation which column contains the
datetime we’re grouping by, and

freq keyword, which is usually a string denoting some
frequency.
In this case, 'A' denotes year end. We could have used 'AS' for year
start, or 'Q' for quarter end, or any of the other offset aliases defined
by pandas.

Months of the year and days of the week are not intervals of time but rather
recurring bins of time, so we don’t use the Grouper() objects for those.
Rather, we use the .dt accessor to pull out the datetime object, and then the
.month or .weekday to pick out the month or day of the week from
DateAcquired field of the datetime object in that field. We can then
groupby that. (And do some tedious work to fix the axes labels.)

Lots of acquisitions in 1964, 1968 and 2008. More acquisitions in October than
any other month. And Tuesdays are busy!

What happened in 1964? First let’s look at the year in detail using pandas
datetime slicing, which allows you to use simple strings to refer to datetimes
and construct a boolean Series with which to filter the DataFrame.

It turns out over 11,000 items were added to the catalog with an acquisition
date of 6 October, 1964. Please let me
know if you know the origin of this anomaly.

Artist trends with time

We looked above at the rate at which MOMA acquires items. Now, let’s examine
the rate at which it adds artists to its collection.

We can use drop_duplicates to eliminate all but the first record with a given
Artist, i.e. to remove all items except the first acquisition of an artist’s
work. We save this in a new DataFrame, and group and plot it as before.

# This is a DataFrame where all items by an artist except their first acquisition are removed
firsts = moma.drop_duplicates('Artist')
fig, ax = plt.subplots(figsize=(14, 3))
(firsts.groupby(pd.Grouper(key='DateAcquired', freq='A'))
.size()
.plot())
ax.set_xlabel('');
ax.set_ylabel('Number of new artists');

Let’s look at trends in the acquisition of the top few artists in the
collection of the Painting & Sculpture department, i.e. the people who make
paintings, sculptures and installations. First we create a list of who these
people are.

This plot is a bit of a mess, since acquisitions by such famous artists are
inevitably infrequent and bursty. But clearly there were lots of Calder
acquisitions in the 70s and Kawara acquisitions in the 90s.

This is the end of the first post on the MOMA collection dataset. In the second
post, I’ll look at how the rate at which MOMA acquires work by women has varied
over time.