PostgreSQL Sample Database

For a recent class on databases, we had to put together a database as a final graduate project. Rather than let my work go to waste, I figured that it would be fun to share it with the world.

spooky bois and spooky data

Getting the Sample Database

First off, this will only work on PostgreSQL. As best as I can tell, I didn’t use any fancy new features of Postgres, so this should work on anything that you’ve installed within recent memory. In my case, I’m using PostgreSQL 10.1 installed via homebrew on a Mac.

You can get the database by cloning the horror-movies-database reposi- tory and then executing psql -f sql/data.sql in the root of the project. You could just as easily open data.sql in DataGrip or pgAdmin and execute it from there.

What Will You Find?

The sample database contains horror movies, crawled from IMDB, from prompt cloud’s “Spooky Dataset for Halloween”. You’ll find a variety of movies in the horror genre from all over the planet. The schema is broken up into the following tables:

movies: Core data about movies.

genres and movie_genres: A many-to-many relationship. Movies can have more than one genre, after all. “Horror” is still included in here, take care when doing genre-based queries.

locations and movie_locations: A many-to-many relationship for filming locations. Unfortunately, the data only seems to have one shooting location for movies, but it’s more fun to plan ahead, right?

cast and movie_cast: The cast of the movie!

countries: The country of origin.

ratings: The movie rating (R, PG-13, etc).

You’ll also find that the data isn’t very clean. A number of movies are missing information, but there’s no real rhyme or reason to what’s missing. This does make querying more fun, though.

Sample Queries

Just to show a bit of interaction, here are two sample queries to get you started. The first query shows the highest budget movie by year and country of origin:

This next query shows the average IMDB rating by country of origin and release year:

SELECT
c.country_name,
m.release_year,
AVG(review) AS average_review_score
FROM movies m
JOIN countries c ON m.country_id = c.id
WHERE release_year IS NOT NULL
AND m.review IS NOT NULL
GROUP BY c.country_name, m.release_year
ORDER BY country_name, release_year;

Have Fun!

Remember to have fun! There’s a lot of interesting data out there that you can play around with. If you find something interesting, share it with the world.

Ask and ye shall receive! I came across this NYC taxi and Uber data set while searching for data sets for the project. It is around 220GB when fully loaded into Postgres and, according to the author, it takes about 3 days to download and import on some older hardware. I made it around 1/3 of the way through in ~12 hours, before realizing that I needed to download the files and import it, so it’d require about 500GB of space before I could free up the source files.