NYC Yellow Cab Data in Azure SQL Data Warehouse

Microsoft Azure provides a convenient feature when creating a new Azure SQL Data Warehouse database. You can create the database from a backup of the Microsoft sample database, AdventureWorksDW. This is super fast and easy. It also has huge tables like dbo.FactInternetSalesReason with 64,515 rows!

Oh, hold on I think I typo’ed. No, actually that is all. For the scale and performance of SQL Data Warehouse, 64,515 rows in a fact table is quite small.

Early attempts

On four separate occasions, I decided to build a larger SQL Data Warehouse for testing and demonstrations. My first attempt was using the copy of StackOverflow’s database which Brent Ozar maintains as a BitTorrent, here. Once I spent all the time to download the database and then export it in a format which is easy to upload and load into SQL Data Warehouse, I realized that I still needed to enhance the scale of the tables. Also, it is not a data warehouse database. I wanted a star schema and some data which is more like the types of scenarios that I wanted to demonstrate.

I then pivoted to using the Azure provided copy of AdventureWorksDW but then went about extending the table sizes. I made it work but I was disappointed with the amount of effort it would require to make the data look real and proportionally increasing all the dimensions with the facts.

Calling a cab

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Procedure for data import

Download

Downloading the files is a matter of calling a URL for each file, one per month. I did this manually because queueing up the file downloads did not take very long. However, if you wish you do something more scripted, you can just call the below URL with the year and month number incremented.

The complete data dictionary is also available on the TLC website in PDF format.

Upload

Create an Azure blob storage account and create a container within to hold your CSV files.

Download and install Microsoft’s AzCopy utility. You can do this in Command Prompt or PowerShell but the install does not append to your PATH environment variable, which makes going straight to the utility easiest.

Run a command, similar to this one, to upload your entire directory of files to blob storage.