2017 Reading List

February 13, 2017

I had a friend ask the other day on Facebook for some recommendations for good, short books to read in 2017.

I thought about books that I could suggest, but very quickly realized that the books that I read aren’t short, and, after I reflected a little, realized that this creates some problems for me — either I lose momentum and don’t finish, or I end up finishing after a seemingly Sisyphean effort that doesn’t feel satisfying.

So I decided to search for the best, shortest books I could read in 2017. I started by pulling data from Goodreads.

Goodreads

Goodreads is a service that lets users rate books and network with other readers. Each book is rated on a 5-point scale. Books can be compiled into reading lists and shared with other users.

There are a lot of books on Goodlist, so in an effort to pull from only the best books, I included only books included on 30 of the most popular “Best Of” lists on the site1. That gave me about 34,000 books to choose from.

Ratings and Rankings

Next, I had to rank the books. I wanted to avoid one of my biggest annoyances when using rating data — ignoring the trade-off between rating and sample size. If I’m looking for pizza on Yelp, why would a restaurant with a single review of 5-stars rank higher than a restaurant with a 4.9 rating based on 1,000 reviews?

My method for overcoming this was to assume a Dirichlet prior over a Multinomial likelihood2. The posterior rating would then give us an estimate on the rating that help penalize obscure books until they demonstrated enough evidence of their acclaim.

The prior I used was based on the aggregate rating of a typical book — one-half of the median of each possible rating, which resulted in Dirichlet(13.5, 42.5, 180.5, 293, 264)3. You can think of it as saying that we’ll assume a book will get 13.5 one-star reviews, 42.5 two-star reviews, 180.5 three-star reviews, 293 four-star reviews, and 264 five-star reviews. Then, each rating observed in the data builds upon the prior assumptions, so each book has to prove its rating, relative to all the other books.

The Final List

So, once I had 34,000 books, their length and their rating, I could finally generate my reading list.

I only had one constraint on the books in my list — they had to be less than 200 pages. So I generated my reading list, and was surprised to see that more than half of the list comprised of Calvin and Hobbes collections. (For whatever reason, Goodreads users really love Calvin and Hobbes — reviews were both numerous (hundreds of thousands of users) and very high.)

So after excluding “Sequential Art” as a genre (and subsequently few more genres I definitely wasn’t interested in4), I got the list below.

Conclusions

I was definitely surprised with the list that this approach came up with — it’s definitely a list of books that I wouldn’t have read otherwise. I’m not sure if I’ll read them in order of ranking, but I’m excited to see what comes out of my reading this year.

Future Work

I’d like to make this data available interactively in a subsequent post so people can generate their own booklist with their own constraints. That’ll take more time than I have available tonight, but I hope to post it soon.

This is a common approach in Bayesian statistics. In Bayesian statistics we allow some prior information to inform our approach, then let data update our prior assumptions, where stronger signals in the data help us make stronger departures in our conclusions. ↩

That’s actually a pretty high prior, right? I was surprised to see books being so highly rated overall. I mean, should it be the assumption that a book is really 4-stars? ↩