We started off exploring the data by calculating means for different combinations of time, day of week and month. We plotted these means to identify patterns in the data. We also explored forecasting based on linear regressions.

We realized that including historical data led to less accurate predictions, so we attempted using weighted linear regressions where proportionally higher weights were assigned to more recent datapoints.

We also tried forecasting through autoregressions, where the preceding n observations (t-1, t-2, …, t-n) are used to make a prediction.

In the end, we decided to work only with the more recent dataset, and ignore the historical data provided by Kaggle.

What ended up working

We used a statistical technique called Ensemble of Decision Trees (often called Random Forest™). One of the nice properties of this approach is that it doesn’t assume a linear relationship between explanatory variables and the predicted outcome.

We used the following explanatory variables:

What time is it? (Hour + (Minute/60))

What is the date? (Month + (Day/31))

What day of the week is it? (Monday, Tuesday…)

“All roads are equal, but some roads are more equal than others!”

Our final algorithm combines two methods. Both methods include time, date, and day of week as explanatory variables.

Method A:

Additionally encodes most recent observation for 2 neighboring routes

Method B:

Additionally encodes most recent observation for 4 neighboring routes and recent trend in travel time

We tried Method A and Method B, and discovered that for some route segments, Method A performed better, while for others, Method B performed better!

Our final algorithm uses:

Method A for route segments 40105-41160

Method B for route segments 40010-40100

We prepared a PDF providing more details about Random Forests and our solution. Click here to read it!