Sifting through the Nonsense
Using Math to Find Great Neighborhoods Anywhere

July 16, 2017 8 minute read

For those just looking for the results, feel free to jump down to the results.

The Problem

It's hard to know where to stay when visiting a new city. First, there are multiple websites to find accommodations, each with their own list of bookings. Then you have to pick the neighborhood you want to stay in, cross-referencing to make sure it's close to the things you want to do. Finally, you have to pick an individual hotel, hostel, or roomshare based off of prices & user reviews. The whole process can be pretty daunting, especially if it's somewhere you've never stayed before.

One of the easiest ways to find accommodations in a city is with an interactive map. A lot of booking sites allow you to search for places using one - but it can quickly lead to analysis paralysis as the map is often littered with too many choices and no guidance for which neighborhoods are the best.

This is an actual booking website's map, which refreshes every 5 seconds. I have no idea what's going on.

Here at TripHappy, we've built a tool to help make this process easier. We wanted to find the best places to stay in the best neighborhoods and close to the things that we want to do. We've analyzed 37 million reviews from three top booking sites - Hostelworld, Booking.com, and AirBnB - in order to find the best places to stay in any city in the world.

Our tool uses cluster analysis (a branch of data science!) to group highly rated accommodations together into neighborhoods. We show you the best neighborhoods in a city and nearby popular things to do, so you're guaranteed to stay somewhere great. From there, you can compare available bookings across HostelWorld, Booking.com, and AirBnB.

Much better! Less clutter!

Let's walk through how we built this tool using data supplied to us by our affiliate partners.

Getting the Data

We received accommodation data about hostels, hotels, and homestays from our affiliate partners covering 13,000 cities and over 1 million accommodations. We also received reviews for each of these bookings, giving us just over 37 million reviews, averaging about 3,000 reviews per city. That's a lot to work with! Without filtering through the data, there's just too much going on to make sense of anything. For example, here's a map of Paris with all of our accommodations.

Too many accommodations! We're no better than anyone else :(

Initial Sifting

Our next step was to sift through the data to find what we can work with. We would definitely need the coordinates of each accommodation in order to group & map them, but there were also multiple review data fields to choose from. For example, each accommodation has an overall rating, and various sub-ratings that vary by provider such as cleanliness, security, and value.

We decided on using an accommodation's location sub-rating as a proxy for how good the neighborhood is. We made an assumption that the best neighborhoods would have tight groupings of high location sub-ratings together. Even if one particularly bad place has a low score, the surrounding accommodations would make up for it in the best neighborhoods.

Histogram of average location ratings for 400 accommodations in Paris, with roughly 100,000 reviews.

A histogram of each review's location rating shows us a heavily left-skewed distribution with a peak at 9. Let's see what happens when we look at this short tail of accommodations with location ratings over 9.

Less clutter!

Much better! You can already start to see pockets of accommodations along the Champs-Élysées, Montmartre, and St Gervais. With this subset, we should be able to start the cluster analysis.

Selecting an Algorithm

There's a lot of great literature on the various types of clustering algorithms out there. For our location data, we decided to implement a modified version of the DBSCAN algorithm with a recursive heuristic to help parameterize the model appropriately for all cities, big and small.

The clustering algorithm will take in the coordinates of our highly rated accommodation subset and group them together into clusters based on locational proximity. Heuristically parameterizing the model gives us some nice clusters around the Seine, one in Montmartre, and one by the Gare du Nord train station.

Looks good!

Now that we have the back-end all sorted out to generate the clusters, we can focus on the front-end to display relevant information to the users. Let's limit our initial search results so as not to clutter the map, and add in some points of interest (gray markers).

Much better! Less clutter!

Looks much better now! Looks like the four neighborhoods along the Seine are close to some great points of interest, as is the cluster in Montmartre. Now we can pick which neighborhood we want to stay in based on what things we want to do and then find the perfect accommodation!

Results for Other Cities

Check out results for other cities below! Also check out our detailed where to stay guides for select cities.

Are you a nerd? Like travel analytics?

Give us your e-mail and we'll let you know when new articles are released.