Blog

Data makes the world go round. The ability to pull, store, cleanse, train, move, mash, and extract data can mean success to a company. Those that can identify correlations and more importantly causations are steps above the rest. Compound this with ever increasing processing speeds, and analytics become even more accessible.

While analyzing internal data is important, there’s also a place for publicly available data, or open data. There are literally free datasets for everything from music to traffic signals. Education to energy and entertainment. With open data one can visualize climate trends, or discover how the world interacts through social and political perspectives. Open data is also useful to pull into internal datasets to find relevance in a larger context.

There are literally hundreds (nay, thousands?) of open datasets out there for the data enthusiast. But with so many to choose from, which ones are especially helpful? Have no fear, fellow reader! Here we present a curated list of some of the most interesting datasets out there along with a comprehensive list of free visualization tools to use in your next project.

Open Datasets & Helpful Resources

There are so many data sources, datasets, and data tools out there it can very quickly be overwhelming. Therefore we’ll spotlight a handful of unique data sources, takeaways, and helpful suggestions. These data sources are especially useful for anyone in the machine learning realm who doesn’t mind working with some serious amounts of data.

Say for instance you want realtime translation of the world’s news in 65 languages to be used for sentiment analysis in reference to geo-location. Or perhaps federal data for low-income college graduation rates overlaid with campaign funding amounts. Or perhaps one just has a unique and insatiable interest to know just how many dependencies there are for Kubernetes on Github. Well my friend, I might suggest some open datasets. Here are a few unique, free-to-use finds across the web:

Deeplearning4J offers powerful open source computations with neural nets and other machine learning goodies for Java. It sports an impressive archive of datasets for Natural Language Processing, facial recognition, and other image detection use cases. Warning: This gets technical quickly and is best used for machine-learning oriented work.

Librarios.io is another impressive open source project. They track “unique open source projects, 25m repositories and 85m interdependencies between them.” That’s right — dependencies between open source projects. It’s almost a 6GB download so you’ve been warned. If you’re interested in developer communities or maintaining open-source software specifically this is a huge asset. Having this level of insight can allow for better discovery, use, and even improving the contributions and support for various projects. That means this type of open data could be leveraged to track behaviors throughout the open source software community at large.

This is another resource for machine learning use cases. UCI maintains 379 niche datasets on everything from wine quality, flowers, to car detection, forest fires, among many other subject areas. Who knew? One potential downside is that some of this free data might be a bit stale — many datasets were donated in the 90s or mid 2000s.

DataWorld not only makes searching for datasets really easy but has a great social function, too. Developers can make their data internal, share amongst an internal or external team, or even make their data open to the world. Most compelling though is being able to manage the dataset over time and work with other developers to brainstorm or come up with neat mashups.

Data is Plural is a weekly newsletter sent out by Jeremy Singer-Vine. It’s basically a few data sources every week on a new, interesting topic. Check out the spreadsheet for the full list of the subjects, like music metadata, climate change, traffic data, and much more.

GDelt is another amazing project. It’s “one of the largest open-access spatio-temporal datasets in existence and pushing the boundaries of “big data” study of global human society.” GDELT is super interesting because it allows anyone to go visualize things like protesters, populations of people displaced, or even number of people killed due to things like natural disasters, diseases, or epidemics. There’s even a visualization tool and sample data sets to help get ideas started. They’ve also teamed up with BigQuery to make working with the data super easy.

The Expansive Open Dataset List

Here are some other massive dumps of datasets worth mentioning. If you can’t find it here it might not exist:

Enigma Public Datasets: Really beautiful site with an easy to use UI. Their data viewer tool makes checking out their broad collection of datasets very easy.

Kaggle: Kaggle is in the business of growing data scientists. They’ve got some fun competitions, educational tools, and plenty of free datasets to poke at.

Awesome Public Datasets on Github: Nifty repo containing a list of, you guessed it, awesome datasets. There’s some overlap with other repo’s here but worth including since it breaks down things into categories really nicely.

Microsoft R Network: These are sample datasets for those interested in statistical computing and machine learning. If you’re an R developer this is for you.

Open Data Network: This is made by Socrata, an organization that works with government data. If you’re interested in government data, this is an easy way to search for related datasets or data visualizations.

Google: Great way to play with data and quickly analyze it. Also helpful if you’d like to pull this into your own work.

AWS: If you’re already on AWS here are datasets for everything from GIS to machine learning.

Visualization Tools:

Once a dataset is created, what’s the easiest way to create a visualization quickly, accurately, and for free? After all, scrolling through an endless table isn’t that helpful. Being able to very quickly grasp a concept is the name of the game and visualization is a great way to get the point across. Rather, generating accurate, consumable visualizations is necessary to paint a story, solve a problem, or learn about possible solutions.

While our datasets are aimed at those who aren’t afraid of the words “machine learning,” these visualization tools are far more usable. We’ve curated the following list of tools that fit the criteria of being able to:

Upload a CSV or link to a Google sheet quickly

Get a visualization without writing a line of code

Do so for free

There are so many times when a data analyst must quickly get a data file into a tool and grab a snippet of a graph to screenshot or reference. While other packages may be more verbose, no JS tools or plugins are found here. These free, open source visualization tools are great for quickly testing a dataset before sinking time into it.

Literally paste, upload, link to any dataset, or pull from examples. Pick a graph type. Specify what you’d like on the X, Y, or Z axis and voila! Literally that easy. The visualizations aren’t the most fun but the information gets across well. Here’s a quick screenshot of a tree-map visualizing baseball team home runs from 1871 until present:

Tableau on its own is a very robust tool. Sometimes too robust for the average user. However, Tableau Public is surprisingly easy to use. It’s a download which isn’t the greatest experience, however working with the data is really easy. The first 10GB is is freeeeee.

A screenshot of the interface:

Tableau visualization

And another screenshot of the same MLB dataset analyzing teams to home runs. Slightly more interactive and nice to look at:

Fusion Tables is an experimental Google app but gets the job done super fast. It also incorporates really well with Google Sheets and other datasets. Highly recommended. Here’s a screenshot of the interface and same baseball data of homeruns to teams:

And another way to visualize data. Making charts is super easy. It’s more difficult to control the types of charts but nonetheless it gets the job done:

Plot.ly is another free visualization tool, however, the UI can be somewhat confusing at times. Though the above tools are more usable, Plot.ly is a great way to quickly mashup data in instances with smaller datasets.

About Ashley Hathaway

Ash works at Pivotal in NYC as a Senior Product Manager. Previous to Pivotal she lived and worked in Austin on the IBM Watson Developer Cloud as a Senior PM and Dev Evangelist for their API product offering. As a former designer she’s passionate about big systems thinking and user-centered activities. She loves The Astros but not as much as The Rockets.