Datasets for Exploratory Analysis

Exploratory analysis is your first step in most data science exercises. The best datasets for practicing exploratory analysis should be fun, interesting, and non-trivial (i.e. require you to dig a little to uncover all the insights).

All links open in a new tab.

Our picks:

Game of Thrones – Game of Thrones is a popular TV series based on George R.R. Martin’s A Song of Fire and Ice book series. With this dataset, you can explore its political landscape, characters, and battles.

World University Rankings – Ranking universities can be difficult and controversial. There are hundreds of ranking systems, and they rarely reach a consensus. This dataset contains three global university rankings.

IMDB 5000 Movie Dataset – This dataset explores the question of whether we can anticipate a movie’s popularity before it’s even released.

Aggregators:

Kaggle Datasets – Open datasets contributed by the Kaggle community. Here, you’ll find a grab bag of topics. Plus, you can learn from the short tutorials and scripts that accompany the datasets.

r/datasets – Open datasets contributed by the Reddit community. This is another source of interesting and quirky datasets, but the datasets tend to less refined.

Datasets for General Machine Learning

In this context, we refer to “general” machine learning as Regression, Classification, and Clustering with relational (i.e. table-format) data. These are the most common ML tasks.

UCI Machine Learning Repository – The UCI ML repository is an old and popular aggregator for machine learning datasets. Tip: Most of their datasets have linked academic papers that you can use for benchmarks.

Datasets for Deep Learning

While not appropriate for general-purpose machine learning, deep learning has been dominating certain niches, especially those that use image, text, or audio data. From our experience, the best way to get started with deep learning is to practice on image data because of the wealth of tutorials available.

Our picks:

MNIST – MNIST contains images for handwritten digit classification. It’s considered a great entry dataset for deep learning because it’s complex enough to warrant neural networks, while still being manageable on a single CPU. (We also have a tutorial.)

CIFAR – The next step up in difficulty is the CIFAR-10 dataset, which contains 60,000 images broken into 10 different classes. For a bigger challenge, you can try the CIFAR-100 dataset, which has 100 different classes.

ImageNet – ImageNet hosts a computer vision competition every year, and many consider it to be the benchmark for modern performance. The current image dataset has 1000 different classes.

YouTube 8M – Ready to tackle videos, but can’t spare terabytes of storage? This dataset contains millions of YouTube video ID’s and billionsof audio and visual features that were pre-extracted using the latest deep learning models.

Datasets for Natural Language Processing

Natural Language Processing (N.L.P.) is about text data. And for messy data like text, it's especially important for the datasets to have real-world applications so that you can perform easy sanity checks.

Our picks:

Enron Dataset - Email data from the senior management of Enron, organized into folders. This dataset was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation.

Amazon Reviews - Contains ~35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.

Datasets for Cloud Machine Learning

Technically, any dataset can be used for cloud-based machine learning if you just upload it to the cloud. However, if you're just starting out and evaluating a platform, you may wish to skip all the data piping.

Fortunately, the major cloud computing services all provide public datasets that you can easily import. Their datasets are all comparable.

Datasets for Streaming

Streaming datasets are used for building real-time applications, such as data visualization, trend tracking, or updatable (i.e. "online") machine learning models.

Our picks:

Twitter API - The twitter API is a classic source for streaming data. You can track tweets, hashtags, and more.

StockTwits API - StockTwits is like a twitter for traders and investors. You can expand this dataset in many interesting ways by joining it to time series datasets using the timestamp and ticker symbol.

Weather Underground - A reliable weather API with global coverage. Features a free tier and paid options for scaling up.

Aggregators:

Satori - Satori is a platform that lets you connect to streaming live data at ultra-low latency (for free). They frequently add new datasets.

Datasets for Web Scraping

Web scraping is a common part of data science research, but you must be careful of violating websites' terms of services. Fortunately, there's a whole site that's designed to be freely scraped.