Tag Archives: 2012

Waiting for the break of day…oooOOOOO…25 or 6 to 4!
-Chicago (formerly The Chicago Transit Authority)

I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive in. No parking hassles or gas. It was great. Transit data is critical to making those systems much more efficient. Fortunately, Rahm Emanuel is kind enough to release some of the transit data from the Chicago Transit Authority (CTA). The data only contains ridership per day information from each station, so I am limited in the insight this analysis can produce.

Before diving into the results of the descriptive analytics, let’s look at how the ‘L’ is designed. There are eight different lines, each designated by a color. All of the lines goes through the downtown area called ‘The Loop’, because the elevated track forms a huge loop around a huge block of the city. There are two main lines which also go underground: the Red and Blue Line. These two lines run all night and carry the most passengers. When a Chicagoan rides the ‘L’ they swipe their pass at the entrance of stations, then board their desired train in either direction. Unlike transit systems like the Metro in DC, there is no exit swipe. So every data point in this post is going to be a person swiping at a station to board a train, but we can’t determine which direction or destination.

There’s also another problem, at several stations that service different lines. Clark/Lake has practically every line go through it. Without more resolution in how the data are measure, the most I can infer from the data is what stations are the most popular on certain days. This comes from the assumption that if a person arrives at a station they will leave from the same station.

This visualization looks a lot like a CTA map. I don’t have a good way to automatically draw the lines between the stations, but I think the location data that’s attached to the station names does a good job of recreating the CTA map everyone is used to seeing. I’ve labeled any ‘L’ stations which service multiple lines in an order of importance. The priority is Red, Blue, Orange, Brown, Green, Purple, Pink, Yellow in that order. The reasoning behind this is that these are the largest or most popular lines, so the station will have the majority of patrons using these lines. From this map Clark/Lake, the station with the most train lines, is the most popular. Terminuses (termini?) of the the lines also have a lot of use. This can help visualize where the transfer points, the most popular entry points or destinations are. The Red Line and Blue Line have the most stops and the most ridership. Admittedly, this analysis has problems parsing Brown/Red Line customers, but there is higher ridership at the non-transfer stations of the Red Line; that confirms that more customers are using the Red Line in general. I have a separate post for the chart of the ridership of every station. The chart is way to big to put in this post, you’ll be scrolling for days. It’s worth checking out though to drill down into the details.

Ridership is rarely constant. In fact, the ridership of the ‘L’ varies into three predictable groups of days: weekdays, Saturday, and Sundays. The differences in the data based on the different day-group will effectively ruin any time-series analysis, because the average values between the three groups varies so much that any trends are going to be hard to spot. The graph would look very erratic. To account for this, any time-series graphs are split into those three groups.

I can’t write a post without tying baseball into it. This will be no exception, because in 2012 I spent a lot of time watching baseball games on the North Side and South Side of Chicago. Both teams are connected by the Red Line, and the stations are extremely close the parks. So what would we expect to see? Baseball games have attendance ranging from 10k to 30k, so this should present a spike in daily ridership. I graphed the daily ridership for the two stops adjacent to the ballparks and then label when there was a home or away game.

Any spike or dip not described by baseball is labeled. St. Patrick’s day has large spike all over the CTA system, but the spike at Addison is particularly high because of the all the bars in the neighborhood. The largest non-baseball spike in the Addison station’s ridership came during the gay pride parade, since this is an incredibly popular event in a neighborhood nearby the station.

There is one anomaly I forgot to point out on the graph, but there’s a spike when the Cubs are away for Sept 8th. At first I thought this might have been labor day, but it turns out that The Boss was playing at Wrigley Friday night, and I remember walking by it late at night. So there’s a spike for the 7th (the day the concert actually was) and then a larger spike from the 8th presumably the concert ended near or after midnight or people stayed after and drank in one of the fine establishments around the park. [I went to a Sox game early that day, so I accounted for a few data points that day.]

I arrived on June 12, 2012 right in the middle of a Cubs-Tigers game. Those were three really crowded games in Wrigleyville. Ridership at the two ballpark ‘L’ stations peaks during the summer, especially at Wrigley. You wouldn’t have guess that the Sox were in contention for a division title up until the last week of the season. The Cubs have a huge tourist draw including me because I went to at least a dozen games while I lived there, and you can see a surge during the summer (vacation) months. I would leave Chicago on October 20, 2012, and start out on #SeanTrek about 11 days later.

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare and obtained over 700 points in Portland and San Francisco, which is insane because I checked in just about everything I did or place I went. On top of Foursquare I geotagged every tweet I sent and picture I took. This resulted in me now having thousands of data points of both timestamps and location data.

The above map is what happens when you put all of them together. It outlines my entire trip! The more dense the marks the more I was in one place longer exploring it. Sparse points means I was driving a lot. You’ll find a lot of marks around Pittsburgh, Portland, SF, LA, Austin, and New Orleans, because I spent the most time there and didn’t drive much in most of those cities. I have a rather nice record of a long trip that didn’t require me to painstakingly record exactly what I did.

This map only has geotag data and the type of media. I’m hoping to use the geotag data and the timestamp to get an average speed between the two points. I also want to geocode some tweets or photos that were not geocoded in 2012 by interpolating using the timestamp now.

Once I properly extract the data from the tweets, I can have hashtags or mentions searchable by frequency and location. I used #SeanTrek a lot more than any other hashtag on the trip. Though curiously enough the first tweet mentioning #SeanTrek is not geotagged. (technical glitch) Hopefully, I’ll get some more things mapped out in the future.