You’ll probably get more out of this post if you read this one first
Continuing from where I left off, I now want to use a normal distribution (i.e. gaussian) to calculate my cookie box distributions.
Unlike the poisson distribution I used earlier, with just the mean as a parameter, the normal distribution is
described by both a mean and a standard deviation.
Chocolate Chip Likelihood
The probability that a box was generated by a normal distribution with mean ($\mu$) and standard deviation ($\sigma$) is given by
$$ \prod_{i=1}^{10} \textrm{normal}(\mu, \sigma)\textrm{.pdf}(\textrm{cookie_box}(i))$$
where pdf is the probability density function and cookie_box(i) is the number of chocolate chips on cookie i.
1
2
3
4
5
6
7
# Likelihood for the heavenly_bites box at mu = 8 and mu = 16 with sigma = 1
np.product(stats.norm(8, 1).pdf(heavenly_bites))
# => 5.35e-26
np.product(stats.norm(16, 1).pdf(heavenly_bites))
# => 1.29e-213
Since my parameter space is 2 dimensional ($\mu$, $\sigma$), I get to use these neat (but not as clear) 3D graphs to show the likelihood plots for the different
cookie boxes.
Cookie Posteriors
Each step in this process is almost exactly the same as in the 1 parameter poisson case, except now with an additional dimension.
The cookie prior is now the plane, pdf($\mu$, $\sigma$) = 1.
Calculating the posterior is way more computationally expensive in this case. To get the same point density as the poisson model, I’d need to update the prior at
$ 20,000 \times 20,000 $, i.e. $400,000,000$ points.
What’s worse, most of those posterior values in this space are ~0.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Thanks to numpy arrays, this part of the code doesn't change at all.
# heavenly_bites_likelihood and cookie_prior are both numpy arrays with dimensions (20000, 10000).
# They contain y values for each function, for mu from [0, 20] and sigma [0, 10]
# in increments of 0.001
# The unnormalized posterior, the numerator in Bayes' Rule
heavenly_bites_un_posterior = np.multiply(heavenly_bites_likelihood, cookie_prior)
# Approximating the integral in the denominator
normalising_factor = np.sum(heavenly_bites_un_posterior)
# Applying Bayes' Rule
heavenly_bites_posterior = np.divide(heavenly_bites_un_posterior, normalising_factor)
Calculating the credible interval is a little trickier here but the core idea is the same (gist)
Cookie Box (0.95)
$\mu$
$\sigma$
Heavenly Bites
(3.72, 9.56)
(1.72, 6.06)
Hobbes’ Nobbs
(9.95, 17.19)
(2.12, 7.38)
Kant Touch This
(10.85, 18.69)
(2.42, 8.18)
Results
A poisson distribution with mean $\lambda$ can be approximated by a normal distribution with mean and variance both $\lambda$, its not a great approximation
at $\lambda < 1000$ but it does help me test my results.
Cookie Box (0.95)
actual $\lambda$
expected $\mu$
expected $\sigma$
Heavenly Bites
8
8
2.83
Hobbes’ Nobbs
12
12
3.46
Kant Touch This
16
16
4.00
Excellent! The actual values are within the predicted ranges, the model works.
Issues
The main problem is the cost to calculate, every additional dimension or an increase in the range can make the model prohibitively expensive.
Fortunately for us, Monte Carlo methods like the Metropolis algorithm generate useful results far more efficiently.
More about these in a later post. In the meantime, take a look at Chris’ post on the subject

I love chocolate chip cookies, the more chocolate chips the better
I’ve managed to narrow down my favourites to 5 brands1, Heavenly Bites, Subjectively Scrumptious, Hobbes’ Nobs, The Happiness Mill and Kant Touch This
I’d like to consistently buy the brand with the most chocolate chips per cookie and so, for science, I pick up a box of each with 10 cookies per box.
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
# Each box of cookies will be generated by drawing from a poisson distribution with
# heavenly_bites having the lowest mean (8) and kant_touch_this, the highest (16)
heavenly_bites = stats.poisson(8).rvs(10)
subjectively_scrupmtious = stats.poisson(10).rvs(10)
hobbes_nobs = stats.poisson(12).rvs(10)
the_happiness_mill = stats.poisson(14).rvs(10)
kant_touch_this = stats.poisson(16).rvs(10)
1
2
3
4
# each number in the array represents the number of chocolate chips on a cookie
heavenly_bites
# => [ 2 4 6 11 4 10 7 6 6 10]
Chocolate Chip Likelihood
Just like earlier I need a way to measure the likelihood of seeing a particular set of cookies. For this
problem I’m going to assume that each bakery produces cookies following a poisson distribution
around some mean.
The probability that a box was generated by a poisson distribution with mean, $\lambda$, is given by
$$ \prod_{i=1}^{10} \textrm{poisson}(\lambda)\textrm{.pmf}(\textrm{cookie_box}(i))$$
where pmf is the probability mass function and cookie_box(i) is the number of chocolate chips on cookie i.
1
2
3
4
5
6
7
# Likelihood for the heavenly_bites box at lambda = 8 and lambda = 16
np.product(stats.poisson(8).pmf(heavenly_bites))
# => 6.36e-12
np.product(stats.poisson(16).pmf(heavenly_bites))
# => 8.47e-27
These are the likelihood plots for Heavenly Bites, Hobbes’ Nobs and Kant Touch This (see how the peaks of the functions are near the true lambda values).
Cookie Posteriors
I’m going to assume that it’s impossible to have a cookie with more than 20 chips (or less than 0), I’m also choosing an initial prior that
assigns equal probability to all possible values of $\lambda$ in this space.
Unfortunately, there’s no clever hack I can use to easily calculate the posterior, I’ll have to resort
to a numerical approximation of Bayes Rule.
1
2
3
4
5
6
7
8
9
10
11
12
# heavenly_bites_likelihood and cookie_prior are both long numpy arrays.
# with dimensions (20000, 1) containing y values for x in [0, 20] but
# in incrememnts of 0.001
# The unnormalized posterior, the numerator in Bayes' Rule
heavenly_bites_un_posterior = np.multiply(heavenly_bites_likelihood, cookie_prior)
# Approximating the integral in the denominator
normalising_factor = np.sum(heavenly_bites_un_posterior)
# Applying Bayes' Rule
heavenly_bites_posterior = np.divide(heavenly_bites_un_posterior, normalising_factor)
Heavenly Bites’ posterior is a lot more concentrated than the others but thats the way the cookie crumbles, it all depends on the amount of information each cookie box provides, a different set of cookies could have led to a different result.
Here’s a gist of the function I used to figure out the intervals.
Sleight of Hand
The real $\lambda$ values are 8, 12 and 16; all within a 0.95 credible interval… so I have my answer, case closed?
Well not really, I cheated, picking a poisson model over any other guaranteed good results, after all, thats what I used to generate the cookie boxes in the first place.
If I was really approaching this problem without this information, I might have chosen a gaussian model or something
else entirely.
I’m going to try that next.
Update: A Clever Hack
As Chris pointed out, a Gamma distribution forms a conjugate prior for a Poisson likelihood. A Gamma distribution has the form $\textrm{gamma}(a, b)$ where $a$ is the shape parameter and $b$ is the inverted scale parameter.
If I have a chocolate chip distribution prior of $\textrm{gamma}(a, b)$ and then examine $n$ cookies, my updated posterior would be
$$\textrm{gamma}(a + \sum\limits_{i=1}^n \textrm{cookie_box}(i), b + n)$$
An uninformative gamma prior looks like $\textrm{gamma}(a = \textrm{~0}, b = \textrm{~0})$ where ~0 is very small but still positive.
1
2
3
4
5
6
7
from scipy.stats import gamma
# Updating the gamma(~0, ~0) prior
shape = 0 + sum(heavenly_bites)
inverted_scale = 0 + len(heavenly_bites)
heavenly_bites_posterior = gamma(shape, scale=1.0/inverted_scale)
The credible intervals are almost the same in this case, the biggest difference being the values for $\textrm{pdf}(\lambda)$ which is an effect of the numerical approximation.
The relative shapes of the distributions seem identical.
An efficient way of solving the same problem and also verifies the numerical approach.
Named for the philosophers found in the excellent Socrates Jones flash game.↩

While Chris has done an excellent job explaining this concept, I’m having too much fun with my coinexample to stop now.
A Maybe Biased Coin
Suppose I try my hand at casting my own coin. If my work is even half as good, my coin is going to end up being rather uneven
and most likely biased.
Unlike the double headed coin problem, I know very little about the coin’s bias. One (not particularly good) choice for a prior is to assign equal probability to all possibilities.
In my last post, $\theta$ could only be one of two values and the prior reflected that. The height of a bar represented the probability mass for each value of $\theta$ and the sum of all heights was 1.
In this case $\theta$ can take any value between 0 and 1. It doesn’t make sense to talk about the probability for a specific value of $\theta$, the probability at any one point is infinitesimal.
Here the prior is plotted as a probability density function, I can calculate the probability mass between any two points $(a, b)$ by integrating $\mathrm{pdf}(\theta)$ over $a$ and $b$
Since this is a probability density function, $\int_{0}^{1} \mathrm{pdf}(\theta)\cdot d\theta = 1$
Continuous Likelihood
The binomial distribution is a function which plots the probability
of any independent binary event (like a coin flip) succeeding for a number of attempts.
If a binary event has a probability $\theta$ of succeeding and succeeds k times out of N
$$\mathrm{P}(\textrm{k successes, N attempts} | \theta) = \binom{N}{k}\cdot\theta^{k}\cdot(1-\theta)^{N-k}$$
Imagine I flip my coin 10 times and end up with only 2 heads.
For an unbiased coin, this is fairly unlikely
$$\mathrm{P}(\textrm{2 successes, 10 attempts} | \theta=0.5) $$
$$ = \binom{10}{2}\cdot(0.5)^{2}\cdot(1 - 0.5)^{10 - 2} $$
$$ = 0.001 $$
Just like last time with a larger number of flips, the likelihood for most values of $\theta$ approaches 0. With additional
flips, the likelihood spread gets smaller with the highest mass at $\frac{k}{N}$.
Continuous Posteriors
I can update the posterior using Bayes’ rule
$$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\int_{0}^{1} \mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)\cdot d\theta} $$
Usually I’d have to approximate a solution using numerical integration but there’s a simpler solution for this particular type of problem.
Conjugate Priors
If the prior and the posterior belong to the same function family, it can make computing the posterior much simpler.
For example, if my prior is Gaussian and my likelihood is Gaussian then my posterior will also be Gaussian, i.e. Gaussian functions are conjugate to themselves.
In this case, since my likelihood function is a binomial distribution, its conjugate prior is a beta distribution.
Specifically, if my prior is of the form $\mathrm{beta}(a, b)$, where a and b are the parameters of the distribution, and the likelihood function is a binomial distribution with N attempts and k successes, then the posterior would be a beta distribution with the parameters a + k and b + N - k.
Updating the posterior reduces to simple addition.
Updating the Posterior
The uniform prior is the same as $\mathrm{beta}(1, 1)$
For 2 heads out of 10 flips and the prior $\mathrm{beta}(1, 1)$
$$ \mathrm{posterior}(\theta) $$
$$ = \mathrm{beta}(a + k, b + N - k) $$
$$ = \mathrm{beta}(1 + 2, 1 + 10 - 2) $$
$$ = \mathrm{beta}(3, 9) $$
Since pdf is a function for probability density and not probability mass, $\mathrm{pdf}(\theta)$ can be greater than 1 as long as the integral over (0, 1) is 1.
Credibility
It doesn’t make sense to ask for the probability of a particular $\theta$ but its useful to know the probability of a range. Knowing that $P(0.1\lt\theta\lt0.3) = 0.95$ is
far better than where I started, especially as the range gets smaller.
This is a credible interval and there are several ways to pick one. I like to pick the highest density interval, the shortest range (or set of ranges) that contains a certain amount of probability mass (0.95 for example).
For a unimodal distribution, like a beta distribution, there is only 1 highest density interval.
Making a Decision
There will always be uncertainty and not making a decision is often not an option. In this situation, I have a few choices. I could pick the mean, which is the expected value of $\theta$, or the mode, which is the most likely value of the distribution (the peak).
a
b
mode
mean
variance
1
2
0
0.33
0.05
2
5
0.2
0.29
0.03
3
9
0.2
0.25
0.01
Either way, with additional flips the variance drops and both values of central tendency more accurately predict $\theta$
Complications
But what about if my $\theta$ changes over time or I have events with multiple possible values and can’t use beta distributions or even numerical integration.
More on that next time.

While Chris has done an excellent job explaining this concept, I’m having too much fun with my coinexample to stop now.
A Maybe Biased Coin
Suppose I try my hand at casting my own coin. If my work is even half as good, my coin is going to end up being rather uneven
and most likely biased.
Unlike the double headed coin problem, I know very little about the coin’s bias. One (not particularly good) choice for a prior is to assign equal probability to all possibilities.
In my last post, $\theta$ could only be one of two values and the prior reflected that. The height of a bar represented the probability mass for each value of $\theta$ and the sum of all heights was 1.
In this case $\theta$ can take any value between 0 and 1. It doesn’t make sense to talk about the probability for a specific value of $\theta$, the probability at any one point is infinitesimal.
Here the prior is plotted as a probability density function, I can calculate the probability mass between any two points $(a, b)$ by integrating $\mathrm{pdf}(\theta)$ over $a$ and $b$
Since this is a probability density function, $\int_{0}^{1} \mathrm{pdf}(\theta)\cdot d\theta = 1$
Continuous Likelihood
The binomial distribution is a function which plots the probability
of any independent binary event (like a coin flip) succeeding for a number of attempts.
If a binary event has a probability $\theta$ of succeeding and succeeds k times out of N
$$\mathrm{P}(\textrm{k successes, N attempts} | \theta) = \binom{N}{k}\cdot\theta^{k}\cdot(1-\theta)^{N-k}$$
Imagine I flip my coin 10 times and end up with only 2 heads.
For an unbiased coin, this is fairly unlikely
$$\mathrm{P}(\textrm{2 successes, 10 attempts} | \theta=0.5) $$
$$ = \binom{10}{2}\cdot(0.5)^{2}\cdot(1 - 0.5)^{10 - 2} $$
$$ = 0.001 $$
Just like last time with a larger number of flips, the likelihood for most values of $\theta$ approaches 0. With additional
flips, the likelihood spread gets smaller with the highest mass at $\frac{k}{N}$.
Continuous Posteriors
I can update the posterior using Bayes’ rule
$$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\int_{0}^{1} \mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)\cdot d\theta} $$
Usually I’d have to approximate a solution using numerical integration but there’s a simpler solution for this particular type of problem.
Conjugate Priors
If the prior and the posterior belong to the same function family, it can make computing the posterior much simpler.
For example, if my prior is Gaussian and my likelihood is Gaussian then my posterior will also be Gaussian, i.e. Gaussian functions are conjugate to themselves.
In this case, since my likelihood function is a binomial distribution, its conjugate prior is a beta distribution.
Specifically, if my prior is of the form $\mathrm{beta}(a, b)$, where a and b are the parameters of the distribution, and the likelihood function is a binomial distribution with N attempts and k successes, then the posterior would be a beta distribution with the parameters a + k and b + N - k.
Updating the posterior reduces to simple addition.
Updating the Posterior
The uniform prior is the same as $\mathrm{beta}(1, 1)$
For 2 heads out of 10 flips and the prior $\mathrm{beta}(1, 1)$
$$ \mathrm{posterior}(\theta) $$
$$ = \mathrm{beta}(a + k, b + N - k) $$
$$ = \mathrm{beta}(1 + 2, 1 + 10 - 2) $$
$$ = \mathrm{beta}(3, 9) $$
Since pdf is a function for probability density and not probability mass, $\mathrm{pdf}(\theta)$ can be greater than 1 as long as the integral over (0, 1) is 1.
Credibility
It doesn’t make sense to ask for the probability of a particular $\theta$ but its useful to know the probability of a range. Knowing that $P(0.1\lt\theta\lt0.3) = 0.95$ is
far better than where I started, especially as the range gets smaller.
This is a credible interval and there are several ways to pick one. I like to pick the highest density interval, the shortest range (or set of ranges) that contains a certain amount of probability mass (0.95 for example).
For a unimodal distribution, like a beta distribution, there is only 1 highest density interval.
Making a Decision
There will always be uncertainty and not making a decision is often not an option. In this situation, I have a few choices. I could pick the mean, which is the expected value of $\theta$, or the mode, which is the most likely value of the distribution (the peak).
a
b
mode
mean
variance
1
2
0
0.33
0.05
2
5
0.2
0.29
0.03
3
9
0.2
0.25
0.01
Either way, with additional flips the variance drops and both values of central tendency more accurately predict $\theta$
Complications
But what about if my $\theta$ changes over time or I have events with multiple possible values and can’t use beta distributions or even numerical integration.
More on that next time.

Plotting Priors
I often find it useful to plot probability functions, it gives me a better idea on how my probabilities are being updated.
Take the double headed coin problem.
A jar has 1000 coins, of which 999 are fair and 1 is double
headed. Pick a coin at random, and toss it 10 times. Given
that you see 10 heads, what is the probability that the next
toss of that coin is also a head?
This plot represents my prior probability for $\theta$, the probability that the coin I pull from the jar will land heads.
Since most of the coins are unbiased (999 out of a 1000), most of the mass is concentrated over 0.5. Theres
an invisible sliver above 1.0 for the double headed coin.
To calculate P(heads), I take the sum over $\theta$ weighted by $P(\theta)$
$$ = \sum\limits_{i=1}^N \theta_i \cdot P(\theta_i) $$
$$ = 0.5 \cdot \frac{999}{1000} + 1.0 \cdot \frac{1}{1000} $$
$$ P(\textrm{ heads }) = 0.5005 $$
Coin Flip Likelihood
After seeing new evidence, to compute a posterior I need both, the prior and the likelihood of the evidence. The posterior
is the normalised product of the prior and the likelihood.
For an unbiased coin, the probability of z heads in a row is $(0.5)^{z}$, for the double headed coin its $(1)^{z}$ which is $1$.
In general, the probability of landing z heads in a row for a coin is $(\theta)^{z}$ when $P(\textrm{ heads }) = \theta$
(Plotting the likelihood for several flips)
See how quickly the likelihood falls for lower values of $\theta$, 10 heads in a row is very unlikely and the plot after 10 flips
reflects that.
While the likelihood for individual $\theta$s is a probability, the likelihood plot isn’t a probability distribution and doesn’t need
to sum to 1.
Plotting Posteriors
I can update the posterior at every $\theta$ using Bayes’ rule
$$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\sum\limits_{i=1}^N \mathrm{likelihood}(\theta_i) \cdot \mathrm{prior}(\theta_i)} $$
The key take away here is that the posterior is simply the prior scaled by the likelihood and then renormalised.
The complicated looking denominator is only there to ensure that the posterior sums up to 1 making it a valid probability distribution.
The likelihood function is exponential in z. The first few flips barely change the distribution but by z=10 either kind of coin is equally likely.
Calculating $ P(\textrm{ heads }) $ for each $ z $
$ P(\textrm{ heads } | z=\space\space1) = 0.5 \cdot 0.998 + 1.0 \cdot 0.002 = 0.501 $
$ P(\textrm{ heads } | z=\space\space5) = 0.5 \cdot 0.969 + 1.0 \cdot 0.031 = 0.516 $
$ P(\textrm{ heads } | z=10) = 0.5 \cdot 0.494 + 1.0 \cdot 0.506 = 0.753 $
Further Reading
Plotting posteriors might give you insights you otherwise would have missed and it hints at even more.
Bayes Rule works just as well for continuous functions, sums become integrals but the core idea remains the same. There are
even neat hacks to make some of those calculations trivial.

Plotting Priors
I often find it useful to plot probability functions, it gives me a better idea on how my probabilities are being updated.
Take the double headed coin problem.
A jar has 1000 coins, of which 999 are fair and 1 is double
headed. Pick a coin at random, and toss it 10 times. Given
that you see 10 heads, what is the probability that the next
toss of that coin is also a head?
This plot represents my prior probability for $\theta$, the probability that the coin I pull from the jar will land heads.
Since most of the coins are unbiased (999 out of a 1000), most of the mass is concentrated over 0.5. Theres
an invisible sliver above 1.0 for the double headed coin.
To calculate P(heads), I take the sum over $\theta$ weighted by $P(\theta)$
$$ = \sum\limits_{i=1}^N \theta_i \cdot P(\theta_i) $$
$$ = 0.5 \cdot \frac{999}{1000} + 1.0 \cdot \frac{1}{1000} $$
$$ P(\textrm{ heads }) = 0.5005 $$
Coin Flip Likelihood
After seeing new evidence, to compute a posterior I need both, the prior and the likelihood of the evidence. The posterior
is the normalised product of the prior and the likelihood.
For an unbiased coin, the probability of z heads in a row is $(0.5)^{z}$, for the double headed coin its $(1)^{z}$ which is $1$.
In general, the probability of landing z heads in a row for a coin is $(\theta)^{z}$ when $P(\textrm{ heads }) = \theta$
(Plotting the likelihood for several flips)
See how quickly the likelihood falls for lower values of $\theta$, 10 heads in a row is very unlikely and the plot after 10 flips
reflects that.
While the likelihood for individual $\theta$s is a probability, the likelihood plot isn’t a probability distribution and doesn’t need
to sum to 1.
Plotting Posteriors
I can update the posterior at every $\theta$ using Bayes’ rule
$$ \mathrm{posterior}(\theta) = \frac{\mathrm{likelihood}(\theta) \cdot \mathrm{prior}(\theta)}{\sum\limits_{i=1}^N \mathrm{likelihood}(\theta_i) \cdot \mathrm{prior}(\theta_i)} $$
The key take away here is that the posterior always lies somewhere between the prior and the likelihood.
The complicated looking denominator is only there to ensure that the posterior sums up to 1 making it a valid probability distribution.
The likelihood function is exponential in z. The first few flips barely change the distribution but by z=10 either kind of coin is equally likely.
Calculating $ P(\textrm{ heads }) $ for each $ z $
$ P(\textrm{ heads } | z=\space\space1) = 0.5 \cdot 0.998 + 1.0 \cdot 0.002 = 0.501 $
$ P(\textrm{ heads } | z=\space\space5) = 0.5 \cdot 0.969 + 1.0 \cdot 0.031 = 0.516 $
$ P(\textrm{ heads } | z=10) = 0.5 \cdot 0.494 + 1.0 \cdot 0.506 = 0.753 $
Further Reading
Plotting posteriors might give you insights you otherwise would have missed and it hints at even more.
Bayes Rule works just as well for continuous functions, sums become integrals but the core idea remains the same. There are
even neat hacks to make some of those calculations trivial.

In my last post I had started building a fanfiction recommender using collaborative filtering. In this post I’ll work through my solution which you can find here
Author Similarity
The core of a recommender is the similarity metric, how similar are two authors and how similar are you to them. Since I have sets of stories for each author (and a set of
my own favourites), I can use a set distance metric as a measure of similarity.
The Jaccard Index is just the the thing.
For any two finite sets, the Jaccard Index is the size of the intersection of the sets divided by the size of their union
$$ J(A, B) = \frac{| \space A \cap B \space |}{| \space A \cup B \space |} $$
The Jaccard Index lies between 0 and 1, inclusive. 1 when both sets are identical and 0 if they have no elements in common.
$$ A = \{ 1, 2, 3 \}, \space B = \{4, 5, 6\}, \space J(A, B) = \frac{0}{6} = 0 $$
$$ A = \{ 1, 2, 3 \}, \space B = \{2, 3, 4\}, \space J(A, B) = \frac{2}{4} = 0.5 $$
$$ A = \{ 1, 2, 3 \}, \space B = \{1, 2, 3\}, \space J(A, B) = \frac{3}{3} = 1 $$
For each author I calculated the Jaccard Index of a set containing their own stories and favourite stories with a set of my favourite stories to get something like the table
below.
Top 5 authors
author
name
similarity
4976703
alexanderwales
0.384615
5118664
daystar721
0.333333
3989854
Sir Poley
0.222222
4767519
Scientist’s Thesis
0.187500
3344060
Velorien
0.166667
Ranking Stories
Now that I’ve got a similarity score for actors, I still need to convert this into a story recommendation. A possible strategy is to calculate the average similarity across
all authors who have favourited or written a story and use that for the story preference.
For example, if alexanderwales (0.38) wrote a story and Sir Poley (0.22) is the only person who liked it, the preference score would be $ \frac{0.38 + 0.22}{2} = 0.3 $
This could work because
Similar authors to me like similar stories which are scored higher, while disimilar authors would have lower scores.
Outliers, stories I wouldn’t have liked but are liked by a very similar author would be lowered by the dissimilar authors who liked it.
This may not work because
If a story hasn’t been liked by enough authors, the average wont be very representative.
Very popular stories, liked by a large group of people would be ranked lower even if I’d have liked them.
Top 5 stories
story
title
sim_total
sim_count
sim_score
10327510
A Bluer Shade of White
0.384615
1
0.384615
10023949
Harry Potter and the Philosopher’s Zombie
0.717949
2
0.358974
9676374
Daystar’s Remix of Rationality
0.333333
1
0.333333
9794740
Pokemon: The Origin of Species
0.967949
4
0.241987
9658524
Branches on the Tree of Time
0.469361
2
0.234681
Conclusions
A Bluer Shade of White and Daystar's Remix of Rationality both have only 1 data point so are probably misscored, on the other hand
a story I do like, Methods of rationality has a score of 0.08 because of the popularity problem. Both of these problems are a result
of limited data (only 100 authors) and should go away with more.
I still see a bunch of stories I haven’t read which look promising and I hadn’t heard of before so tentatively successful, I’ll need to
read them to make sure.
I encourage you to try this yourself, I’ll be interested to see if it helps.
[update]
Recommending stories you aren’t interested in? Try out filtered.ipynb to
use only a subset of the data filtered by genre and category.

In my last post I had started building a fanfiction recommender using collaborative filtering. In this post I’ll work through my solution which you can find here
Author Similarity
The core of a recommender is the similarity metric, how similar are two authors and how similar are you to them. Since I have sets of stories for each author (and a set of
my own favourites), I can use a set distance metric as a measure of similarity.
The Jaccard Index is just the the thing.
For any two finite sets, the Jaccard Index is the size of the intersection of the sets divided by the size of their union
$$ J(A, B) = \frac{| \space A \cap B \space |}{| \space A \cup B \space |} $$
The Jaccard Index lies between 0 and 1, inclusive. 1 when both sets are identical and 0 if they have no elements in common.
$$ A = \{ 1, 2, 3 \}, \space B = \{4, 5, 6\}, \space J(A, B) = \frac{0}{6} = 0 $$
$$ A = \{ 1, 2, 3 \}, \space B = \{2, 3, 4\}, \space J(A, B) = \frac{2}{4} = 0.5 $$
$$ A = \{ 1, 2, 3 \}, \space B = \{1, 2, 3\}, \space J(A, B) = \frac{3}{3} = 1 $$
For each author I calculated the Jaccard Index of a set containing their own stories and favourite stories with a set of my favourite stories to get something like the table
below.
Top 5 authors
author
name
similarity
4976703
alexanderwales
0.384615
5118664
daystar721
0.333333
3989854
Sir Poley
0.222222
4767519
Scientist’s Thesis
0.187500
3344060
Velorien
0.166667
Ranking Stories
Now that I’ve got a similarity score for actors, I still need to convert this into a story recommendation. A possible strategy is to calculate the average similarity across
all authors who have favourited or written a story and use that for the story preference.
For example, if alexanderwales (0.38) wrote a story and Sir Poley (0.22) is the only person who liked it, the preference score would be $ \frac{0.38 + 0.22}{2} = 0.3 $
This could work because
Similar authors to me like similar stories which are scored higher, while disimilar authors would have lower scores.
Outliers, stories I wouldn’t have liked but are liked by a very similar author would be lowered by the dissimilar authors who liked it.
This may not work because
If a story hasn’t been liked by enough authors, the average wont be very representative.
Very popular stories, liked by a large group of people would be ranked lower even if I’d have liked them.
Top 5 stories
story
title
sim_total
sim_count
sim_score
10327510
A Bluer Shade of White
0.384615
1
0.384615
10023949
Harry Potter and the Philosopher’s Zombie
0.717949
2
0.358974
9676374
Daystar’s Remix of Rationality
0.333333
1
0.333333
9794740
Pokemon: The Origin of Species
0.967949
4
0.241987
9658524
Branches on the Tree of Time
0.469361
2
0.234681
Conclusions
A Bluer Shade of White and Daystar's Remix of Rationality both have only 1 data point so are probably misscored, on the other hand
a story I do like, Methods of rationality has a score of 0.08 because of the popularity problem. Both of these problems are a result
of limited data (only 100 authors) and should go away with more.
I still see a bunch of stories I haven’t read which look promising and I hadn’t heard of before so tentatively successful, I’ll need to
read them to make sure.
I encourage you to try this yourself, I’ll be interested to see if it helps.
[update]
Recommending stories you aren’t interested in? Try out filtered.ipynb to
use only a subset of the data filtered by genre and category.

I’m currently working through Python for Data Analysis (you
can follow along here) using the useful IPython Notebook
and Pandas. While the book has been enlightening, I think its time to dive into a meaty problem to get some
experience.
Some of the stories on fanfiction.net are brilliant but they’re very difficult to find, there are just too many to look through. So
far I’ve relied on r/rational as a filter but I think I can do better.
My goal is to create a simple collaborative filtering recommender using the preferences
of the authors on fanfiction.net.
The Data
Authors on fanfiction.net have lists of stories they’ve written, favourite stories and favourite authors.
Each story has several attributes like chapters and categories
I’ve written a web scraper which collects author records using a
breadth first search starting with a seed list of author ids.
In case you don’t feel like picking up Clojure or writing your own scraper I’ve uploaded a 100 author records
here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// an author record
{"id":"2269863",
"name":"Less Wrong",
"author-stories":[{"id":"5782108",
"title":"Harry Potter and the Methods of Rationality",
"author":"2269863",
"date-submitted":"2010-02-28T08:12:39Z",
"date-updated":"2014-07-26T02:01:00Z",
"categories":["Harry Potter"],
"genres":["Drama", "Humor"],
"chapters":102,
"word-count":565501,
"follows":13290,
"favourites":14432,
"reviews":24766,
"language":"English",
"complete?":false,
"rating":"T"},
...],
"favourite-stories":[{"id":"5193644",
"title":"Time Braid",
... },
...],
"favourite-authors":["4976703",
... ]}
Wrangling
A step by step guide on how to parse these records can be found in wrangling.ipynb
I’ve decided on 6 DataFrames
authors - author information (only name for now) indexed by author id
stories - story information indexed by story id
favourite_authors - mapping of author id to favourite author ids
favourite_stories - mapping of author id to favourite story ids
genres - a relation DataFrame mapping story ids to genres
categories - a relation DataFrame mapping story ids to categories
For the full schema in nicely formatted markdown check out the metafiction readme
My first plot
Matplotlib plays well with Pandas and a one liner reveals the most popular genres in the records I’ve got.
1
genres.sum().order().plot(kind="barh")
Stories can be in multiple genres and most stories have at least one romantic relationship so this isn’t as damning as you might believe.
Although ‘angst’ and ‘hurt’ being quite high up does lead some credence to the image of the Tortured artist
Next steps
I’ve been pointed to Programming Collective Intelligence
to help me build the recommender, any other suggestions welcome.
Stay tuned for part II

I’m currently working through Python for Data Analysis (you
can follow along here) using the useful IPython Notebook
and Pandas. While the book has been enlightening, I think its time to dive into a meaty problem to get some
experience.
Some of the stories on fanfiction.net are brilliant but they’re very difficult to find, there are just too many to look through. So
far I’ve relied on r/rational as a filter but I think I can do better.
My goal is to create a simple collaborative filtering recommender using the preferences
of the authors on fanfiction.net.
The Data
Authors on fanfiction.net have lists of stories they’ve written, favourite stories and favourite authors.
Each story has several attributes like chapters and categories
I’ve written a web scraper which collects author records using a
breadth first search starting with a seed list of author ids.
In case you don’t feel like picking up Clojure or writing your own scraper I’ve uploaded a 100 author records
here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// an author record
{"id":"2269863",
"name":"Less Wrong",
"author-stories":[{"id":"5782108",
"title":"Harry Potter and the Methods of Rationality",
"author":"2269863",
"date-submitted":"2010-02-28T08:12:39Z",
"date-updated":"2014-07-26T02:01:00Z",
"categories":["Harry Potter"],
"genres":["Drama", "Humor"],
"chapters":102,
"word-count":565501,
"follows":13290,
"favourites":14432,
"reviews":24766,
"language":"English",
"complete?":false,
"rating":"T"},
...],
"favourite-stories":[{"id":"5193644",
"title":"Time Braid",
... },
...],
"favourite-authors":["4976703",
... ]}
Wrangling
A step by step guide on how to parse these records can be found in wrangling.ipynb
I’ve decided on 6 DataFrames
authors - author information (only name for now) indexed by author id
stories - story information indexed by story id
favourite_authors - mapping of author id to favourite author ids
favourite_stories - mapping of author id to favourite story ids
genres - a relation DataFrame mapping story ids to genres
categories - a relation DataFrame mapping story ids to categories
For the full schema in nicely formatted markdown check out the metafiction readme
My first plot
Matplotlib plays well with Pandas and a one liner reveals the most popular genres in the records I’ve got.
1
genres.sum().order().plot(kind="barh")
Stories can be in multiple genres and most stories have at least one romantic relationship so this isn’t as damning as you might believe.
Although ‘angst’ and ‘hurt’ being quite high up does lead some credence to the image of the Tortured artist
Next steps
I’ve been pointed to Programming Collective Intelligence
to help me build the recommender, any other suggestions welcome.
Stay tuned for part II