Data Science Problem Statement

Can aircraft models or makers predict flights delays, in the local market?

Executive Summary

Every year nearly 6 millions domestic passenger flights made in United States. We are certainly living in the world with air-travels. Despite the shortest travel time recorded between point A to point B, traveling with airplane requires passengers to have additional time needed before and after you are actually sitting in the aircraft; traveling to the airport, handling baggages, having security check, shopping some souvenirs and lined up for the boarding. After all, you are expecting to have your flight takes off on time to your destination. From the most recent monthly report that Bureau of Transportation Statistics, on-time performance is 80.33% for the year-to-date through April 2018.(https://www.bts.gov/content/summary-airline-time-performance-year-date-through-april-2018) In other words, one out of five you may experience any kinds of delays.

There are many prediction model on flight delays already out in the market. Google flights already built the system by using AI to predict the delays. In my capstone project, what I tried to build is, I would like to explore a certain aircraft model or its manufacturer can play any role in forecasting delays, unlike to the many other studies. If it is discovered that there is a positive relationship between the variables with the prediction, I would like to expand my research to comparing with each airline's fleet possession. This may go further to one of the considering factor to airline company's strategic decision whether to have diverse aircrafts models or minimize their inventory.

Due to the limited time and resources, I am not yet to be reached or finalized verifying my initial inference: aircrafts can be a predictor of the flight delays. I will gear up with more features by adding cleaned aircraft information and try different modeling. This executive summary and my entire project will be updated as I will be exploring further. Until then, here is what I've been made far in my research.

I got my 29-millions-data-points-sized-file of the on-time performance for the domestic passenger flights between 2013 to 2017. After the glimpse on the dataset and decided to subset it as local market observation with holidays time period. It ended up with the 520 thousands of records for the computational purpose. I ran for the Logistic Regression and Random Forest modeling to try out.

Since the dataset itself already contains what's the cause of its delays in five different categories, I used my target as binary classification on the delay time over 15 mins in either departure or arrival scene. Unlikely to the common conception as extreme weather will be the primary cause, late-arriving aircraft is the biggest reason of the flight delays. Therefore, I would like to gear my model with aircrafts model, manufacturer and its age, as I mentioned above. Hence I set my features as sin/cos time, distance, dummified all categorical variables, such as: day of week, carrier, airports, aircraft type, manufacturer, etc.

The baseline accuracy were 0.7728 for not delayed, 0.2271 for delayed. In my first model with Logistic regression, it slightly defeat the baseline as train set: 0.7734, test set: 0.7731 . It will be improved with some more feature engineering on my dataset. The next model is Random Forest, it turned out train set: 0.9843, test set: 0.8498. With GridSearchCV on RandomForest, as max-depth = 15 and n-estimator = 10, it got train set: 0.7916, test set: 0.7901 beat the baseline.

Disclaimer: All of the large data files ignored by .gitignore, therefore notebook might not reproduce same results without the datasets. Also, flightradar24 log-in credential muted for the security purpose.

Data Science Tools and Methods Applied

ML (Machine Learning)

Tried several models to predict.

RandomForest

LogisticRegression

Webscrapping

To create automatic webscrapping for detailed information on the aircrafts

Selenium

Data Acquisition

(1) On-Time Performance data from the Bureau of Transportation Statistics

Pulled monthly data on domestic passenger flights from the agency's website between 2013 January to 2017 December and aggregate them into one dataset.

Acknowledgements and references

(1) Ben Shaver

General Assembly's local instructor, Ben, helped a lot on the code challenges I had. Also, he showed his interests in this project and put a lot of his time working on the progress. I would like to acknowledge his help and dedication.

(2) kaggle Tutorial

This state of the art tutorial on predicting delays with strong visualization and codes guided me to unlock all the possibilities on this project. Many codes and plots, I commented in the notebook, were captured from this work. I would like to extend my appreciation to Mr. Fabien Daniel