Let's say you live in San Francisco and work in Mountain View. You like to sleep in and leave at 7:30 AM. But what if you discovered that leaving 15 minutes earlier would—on average—shave 15 minutes off your commute? Would you still sleep in?

Over the course of two months, I researched common Bay Area commutes with a single goal: to predict travel times between two locations, given a time of departure. We can easily access real-time travel information with apps like Google Maps or Waze, but we don't have access to tools that can help us plan trips accurately in advance.

Google Maps does provide travel duration estimates for routes. However, in some cases, these ranges can exceed 30 minutes (or more!). While trip estimates offer valuable insight, I believe that we can do better—not just in accuracy but in accessing the full story behind the raw data. To do this, we must follow a simple formula: 1) visualize the facts and 2) tailor our assessment accordingly. But first, we must acquire a data set that fits our needs.

The Data

Unfortunately, historical traffic data is not freely available. With this in mind, I set out to create an original data set, aggregating real-time traffic data over six weeks. Due to API quota constraints, I had to limit the scope of the project to nine residential neighborhoods and three work districts in the San Francisco Bay Area (see below).

For the morning commutes, I collected trip durations every fifteen minutes for each of the 26 routes (residential —> work). For the afternoon return trips, I repeated the process in reverse (work —> residential). Between April 28th and June 8th, I acquired information encompassing 29 weekdays. To keep things simple, I disregarded weekends and holidays; I also grouped all 29 weekdays together, assuming that most commuters follow the same routine Monday through Friday.

Visualizing the Facts

Let's narrow our focus to the morning commute from Pacific Heights (in San Francisco) to Mountain View. The graph below shows trip durations (i.e. how long it takes to get from point A to point B) in fifteen minute intervals, for each of the 29 days. At first glance, there seems to be a pattern, but there is quite a bit of noise between 7 and 10 AM.

Clearing things up, the next graph shows the exact same data in a box-and-whisker format. For those unfamiliar with this, the vertical lines represent the full range of the data, the boxes represent the middle 50%, the horizontal lines in the boxes represent the medians, and the points beyond the whiskers are outliers.

Speaking of outliers, there are quite a few points that diverge from the rest of the data—the aforementioned noise between 7 and 10 AM. Despite these outliers, most of the data indicates a close relationship between trip duration and departure time. Between 6 and 8 AM, average trip duration increases rapidly from 54 minutes to 1 hour 20 minutes; there is also a dramatic increase in variability, which persists until about 10 AM. Intuitively, this is morning rush hour. As more people hit the road, traffic increases along with the likelihood of traffic incidents—hence the variability.

To offer another perspective, I've mapped each of the 29 days onto another box plot. However, something here doesn't add up. Five days deviate significantly between 7 and 10 AM, almost creating a separate distribution well below the other data.

As it turns out, my initial assumption that all weekdays could be treated equally was incorrect. For this particular route, Friday has significantly different traffic patterns. So if you live in San Francisco and commute to Mountain View every morning, you can probably sleep in on Fridays.

Tailoring the Analysis

The three charts below show the same data as before but with Fridays removed. Though it only includes 24 days (Mon-Thurs, between April 28th and June 8th), this new data set demonstrates an even stronger relationship between trip duration and departure time.

This new data also suggests less variability during peak traffic hours, specifically between 7 and 10 AM. As such, the data is distributed much closer to the average trip duration at each departure time. To illustrate this, the figure below shows standard deviations of both the original data set (with Fridays) and adjusted data set (without Fridays) at each departure time. Without question, the exclusion of Fridays dramatically improves the accuracy of the model when looking at typical trip durations on weekdays, Monday through Thursday, for this route.

Conclusion

For the morning commute from San Francisco to Mountain View, average trip durations actually provide a fairly accurate source of prediction. Just take a look at the updated box plot with only 24 days (Mon-Thurs). Even during peak traffic hours, the data stays well within a range of 15 minutes, which—for this route—is half the size of estimates offered by other applications.

Of course, these results may not hold true outside the sample set, but they present a compelling case for looking at the full story behind the data. We accomplished this by visualizing the data and customizing our analysis according to key observations. To answer our original question, leaving 15 minutes earlier may not shave 15 minutes off your morning commute, but leaving 30 minutes earlier will, in some cases (see 7 - 7:30 AM below), add 15 minutes back to your day. That's 15 minutes you can use for yoga, coffee, blogging, buying auto insurance, etc.

Epilogue

We've discussed the morning commute from Pacific Heights to Mountain View. But what about the return trip in the afternoon? What about other commutes? Should we always remove Fridays?

For the afternoon drive from Mountain View to Pacific Heights, there is significantly more variability in the data, even with the exclusion of Fridays. However, there is a noteworthy relationship between trip duration and time of departure. As you can see, the worst time to leave work is typically around 5:30 PM, but average trip durations start to reduce rapidly after 6 PM.

Ultimately, not all routes can be treated equally; the same goes for weekdays. Any given route must be analyzed independently to unearth patterns and the story behind the raw data. To succeed, one must visualize and customize.