Exploring Chicago rideshare data in R

Last month, the City of Chicago released detailed rideshare data from companies like Uber, Lyft and Via, making it the first city in the country to share anonymized data on ride-hailing companies.

While the trip dataset covers only November and December of 2018, the passenger behavior it contains can reveal insights about how people are using the ride-hailing platforms – all of which is interesting to journalists, data scientists and marketing teams, which is why we wrote up this tutorial.

Import data

These files are huge, so we took a 10 percent sample. We also removed all missing data because some rides occurred outside Chicago. Read more about this in the data dictionary here.

rides %
dplyr::sample_frac(size = .1)

Wrangle

The trip start and end times came into RStudio as factors, with AM and PM in 12-hour format. These needed to be converted to dates in a 24-hour format, in a local timezone. We also created ride hour, day of the week, week, and date variables for the trips.

Visualize

We like to start by visualizing the entire dataset to ensure there isn’t any corruption, missing data, etc. The three packages that are great for this are skimr, visdat and inspectdf. All three packages come with a slew of functions for visualizing your data and underlying variable distributions.

The problem

Lyft can increase long term value (LTV) and share of passenger transportation budget by targeting high intent times where passengers are most in need of rides. For instance, these might be commuting to and from work, or going out at night on the weekend.

Visualize the trips by hour of the day

We know we want to see trips across at least two levels (day or the week, and time of the day). The visualization below displays the number of trips taken per hour of the day across the days of the week.

Specifically, the rides.chicago data frame is piped (%>%) over to the ggplot2 functions to create histograms, and then faceted by the days of the week to show the rides-per-hour breakdown across each day.

Tips by ride duration

The plot below shows the tips given at different trip durations. We can sample our data using dplyr::sample_frac() function from for a more manageable data set. We group these data by the two variables of interest (tipper and ride_category), then create a mean of the trip duration (mean_trip_mins) for a more interpretable visualization across these groups.

What did we learn?

Motivating passengers to engage in tipping is another source of payment that benefits drivers. Although tipping is clearly less common than not tipping, this is a place where learning more about the factors influencing tip behavior could be interesting.

Tipping and Ride Duration

For riders who do tip, we want to know what the relationship is between the amount of the tip and the duration of the ride. The graph below displays these two variables across time of day.

We can see that there isn’t much of a predictable influence of ride duration on tipping behavior, with maybe the exception of “night life” rides. This is an interesting finding because it contrasts very different use cases (commuter vs. leisure time rides).

Longer ride durations will generate more revenue for both drivers and the ride share service. This maybe a useful signal of LTV if these passengers have a regular use case for taking these longer rides.

Next steps

What have we observed?

Rideshare trips tend to be clustered around early morning commute hours and “night life” hours. The surge in “night life” hours is particularly pronounced on Fridays and Saturdays, not surprisingly, with a sharp decline Sunday evening.

In addition we can see that there are behavioral gaps that influence our passenger’s engagement with both the product and their drivers. One of those behaviors it tipping. Tipping is infrequent overall, but the time of day appears to influence a passenger’s willingness to tip more than the duration of the ride. Longer rides tend to occur early in the week, which suggests a possible passenger scenario where an initial trip is required for the week (such as consultants who only travel early in the week to get to their clients).

These visualizations have helped us uncover some trends and relationships between time, frequency and behavior in the Chicago ride share data. The next step might be a static report, Powerpoint presentation, or PDF. Ideally, we would be able to come up with an intervention, design an experiment, and build a dashboard that would allow real-time data and ongoing results of our investigation.

Martin is a tidyverse/R trainer in Oakland, CA. Find him on Twitter.
Peter is an entrepreneurial minded data scientist whose expertise in building analytic solutions and insights to tell data driven stories have impacted organizations including Alibaba, Citrix and Lyft. Peter has lead experimentation design and analytics projects focused on retention, user acquisition and channel optimization in the SaaS and rideshare spaces. The solutions he has produced include incrementality testing, segmentation, ML models and fraud. Peter is a passionate advocate for building analytics teams in cross-functional environments.

Subscribe to our newsletter

Storybench on Twitter

Collaborative, Open, Mobile

Over the last three years, Storybench has interviewed 72 data journalists, web developers, interactive graphics editors, and project managers from around the world to provide an “under the hood” look at the ingredients and best practices that go into today’s most compelling digital storytelling projects.

They boil down to three key areas of emphasis: 1) highly networked, team-based collaboration; 2) an ethos of open-source sharing, both within and between newsrooms; 3) and mobile-driven story presentation. Read our paper here.

What is Storybench?

Storybench takes an “under the hood” look at the latest in digital storytelling, from data viz and investigative journalism to VR and digital humanities. In addition to in-depth interviews with industry practitioners, we offer hands-on tutorials that can be “downloaded” right into the classroom or newsroom.

Want to contribute to Storybench? Pitch us or join us for a graduate degree in the Media Innovation program at Northeastern University’s School of Journalism.

The Reinventing Local TV News Project, from Northeastern's School of Journalism, is looking hard at the formats and practices of local news stations, and suggesting new ways of telling stories that can better engage diverse audiences. Read our inaugural post here.