Menu

Background

Backpropagation is a common method for training a neural network. There is no shortage of papers online that attempt to explain how backpropagation works, but few that include an example with actual numbers. This post is my attempt to explain how it works with a concrete example that folks can compare their own calculations to in order to ensure they understand backpropagation correctly.

If this kind of thing interests you, you should sign up for my newsletter where I post about AI-related projects that I’m working on.

Backpropagation in Python

You can play around with a Python script that I wrote that implements the backpropagation algorithm in this Github repo.

Overview

For this tutorial, we’re going to use a neural network with two inputs, two hidden neurons, two output neurons. Additionally, the hidden and output neurons will include a bias.

Here’s the basic structure:

In order to have some numbers to work with, here’s are the initial weights, the biases, and training inputs/outputs:

The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.

For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network to output 0.01 and 0.99.

The Forward Pass

To begin, lets see what the neural network currently predicts given the weights and biases above and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.

We figure out the total net input to each hidden layer neuron, squash the total net input using an activation function (here we use the logistic function), then repeat the process with the output layer neurons.

Total net input is also referred to as just net input by some sources.

Here’s how we calculate the total net input for :

We then squash it using the logistic function to get the output of :

Carrying out the same process for we get:

We repeat this process for the output layer neurons, using the output from the hidden layer neurons as inputs.

Here’s the output for :

And carrying out the same process for we get:

Calculating the Total Error

We can now calculate the error for each output neuron using the squared error function and sum them to get the total error:

Some sources refer to the target as the ideal and the output as the actual.

The is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here [1].

For example, the target output for is 0.01 but the neural network output 0.75136507, therefore its error is:

Repeating this process for we get:

The total error for the neural network is the sum of these errors:

The Backwards Pass

Our goal with backpropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.

Output Layer

Consider . We want to know how much a change in affects the total error, aka .

is read as “the partial derivative of with respect to “. You can also say “the gradient with respect to “.

We perform the actual updates in the neural network after we have the new weights leading into the hidden layer neurons (ie, we use the original weights, not the updated weights, when we continue the backpropagation algorithm below).

Hidden Layer

Next, we’ll continue the backwards pass by calculating new values for , , , and .

Big picture, here’s what we need to figure out:

Visually:

We’re going to use a similar process as we did for the output layer, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. We know that affects both and therefore the needs to take into consideration its effect on the both output neurons:

Starting with :

We can calculate using values we calculated earlier:

And is equal to :

Plugging them in:

Following the same process for , we get:

Therefore:

Now that we have , we need to figure out and then for each weight:

We calculate the partial derivative of the total net input to with respect to the same as we did for the output neuron:

Putting it all together:

You might also see this written as:

We can now update :

Repeating this for , , and

Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.298371109. After this first round of backpropagation, the total error is now down to 0.291027924. It might not seem like much, but after repeating this process 10,000 times, for example, the error plummets to 0.000035085. At this point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 actual) and 0.984065734 (vs 0.99 actual).

If you’ve made it this far and found any errors in any of the above or can think of any ways to make it clearer for future readers, don’t hesitate to drop me a note. Thanks!

I spent some time recently working on a small side project that I’m excited to share with you all today. It’s an A/B test significance calculator and you can check it out at ABTestCalculator.com.

What’s an A/B test significance calculator, you ask? At a high level, A/B testing is a technique that allows you to improve your website by showing visitors one of several versions of something and then measuring the impact each version has on some other event. For example, you might A/B test the wording of a button on your homepage and find that it increases the number of people who sign up by 10%. An A/B test significance calculator helps you analyze the results of an A/B test to determine whether there is a statistically significant change that is not just the result of random chance.

Why build another? Three reasons: to learn the math, to get better at JavaScript, and to build a tool that makes the results of an A/B test easier to understand.

I think most of these other tools do users a disservice by not clearly explaining how to interpret the results. They tend to throw around the percentage improvement and significance figures without explaining what they mean which in the past has led me to make uninformed and often wrong decisions. Worse, most don’t tell you when you don’t have enough participants or conversions in your test and will happily apply statistical analysis to your results even when those methods don’t apply.

It is my hope with this tool that users leave with a clearer understand of how to interpret the results. A few features:

An executive summary that provides an overview in plain English about how to interpret the results

One graph showing where the true conversion rate for each variation falls (using something called a Wald approximation) and another showing the percentage change between those two distributions

It handles ties as well as tests where there aren’t enough participants or conversions to come to a conclusion

Results are significant when there is at least a 90% chance that one variation will lead to an improvement

The ability to copy a URL for the results to make them easier to share

If you have any suggestions on how to make it better please don’t hesitate to let me know.

On the coding site of things, most of the JavaScript I’ve done in the past (including Preceden and Lean Domain Search) has been with lots and lots of messy jQuery. A lot of new JavaScript technologies have come out in the last few years and I was put on a project at Automattic not too long ago that uses many of them. I fumbled around with it, getting stuff done but not really understanding the fundamentals.

I’m happy to say that this tool uses React for the view layer, NPM and Browserify for dependency management, Gulp for its build system, parts of ES6 for the code (courtesy of Babel), JSHint for code analysis, Mocha for testing, and Github Pages for hosting — all of which I had little to no experience with when I started this project. If you’re interested in checking it out, all of the code is open source (my first!) so you can view it on Github.

This project is the best JavaScript I know how to do so if you do check out the code, please let me know if you have any suggestions on how to improve it.

One last note in case you were wondering about the domain: the former owner had a simple A/B test calculator up on it, but wasn’t actively working on it so I found his email via WHOIS, offered him $200 for it, he countered with $700, I countered with $300 and we had a deal. Normally I wouldn’t pay someone for a domain (I heard there is this amazing service to help people find available domains…), but the price was right and I figured it was worth it for the credibility and SEO value it adds. When I showed him the new site recently all he responded with was “I’m pretty glad I sold the domain now!” which was nice :).

Today we’ll take it a step further, showing how those results can help us understand the distribution of a sample proportion.

Consider the following example:

Out of the last 250 visitors to your website, 40 signed up for an account.

The conversion rate for that group is 16%, but it’s possible (and likely!) that the true conversion rate differs from this. Statistics can help us determine a range (a confidence interval) for what the true conversion rate actually is.

Recall that in the last post we said that the mean of the binomial distribution can be approximated with a normal distribution with a mean and standard deviation calculated by:

For a proportion, we want to figure out the mean and standard deviation on a per-trial basis so we divide each formula by n, the number of trials:

With the mean and standard deviation of the sample proportion in hand, we can plot the distribution for this example:

As you can see, the most likely conversion rate is 16% (which is no surprise), but the true conversion rate can fall anywhere under that curve with it being less and less likely as you move farther away.

Where it gets really interesting is when you want to compare multiple proportions.

Let’s say we’re running an A/B test and the original had 40 conversions out of 250 like the example above and the experimental version had 60 conversions out of 270 participants. We can plot the distribution of both sample proportions with this code:

Here’s the result:

What can we determine from this? Is the experimental version better than the original? What if the true proportion for the original is 20% (towards the upper end of its distribution) and the true proportion for the experimental version is 16% (towards the lower end of its distribution)?

My handy Elementary Statistics textbook, which I’m using to get smart on the math behind A/B testing, states the following:

Normal Distribution as Approximation to Binomial Distribution

If and , then the binomial random variable has a probability distribution that can be approximated by a normal distribution with the mean and standard deviation given as:

In easier to understand terms, take the following example:

Each visitor to your website has a 30% chance of signing up for an account. Over the next 250 visitors, how many can you expect to sign up for an account?

The first formula lets us figure out the mean by simply multiplying the number of visitors by the probability of a successful conversion:

Simple enough and fairly easy to understand.

The second formula, the one to figure out the standard deviation, is less intuitive:

Why are we taking the square root of the product of these three values? The textbook doesn’t explain, noting that “the formal justification that allows us to use the normal distribution as an approximation to the binomial distribution results from more advanced mathematics”.

Because this standard deviation formula plays a big role in calculating the confidence intervals for sample proportions, I decided to simulate the scenario above to prove to myself that the standard deviation formula is accurate.

The R script below simulates 250 visitors coming to a website, each with a 30% chance of signing up. After each group of 250 visitors we track how many of them wound up converting. After all of the runs (the default is 1,000, though the higher the number the more accurate the distribution will be) we plot the probability distribution of the results in blue and a curve representing what we’d expect the distribution to look like if the standard deviation formula above is correct in red.

The distribution of results from this experiment paints a telling picture:

Not only is the mean what we expect (around 75), but the standard deviation formula (which said it would be 7.25) does predict the standard deviation from this experiment (7.25). Go figure :)

As we’ll see, we can use the fact that the normal distribution approximates a binomial distribution approximates to figure out the distribution of a sample proportion, which we can then compare to other sample proportion distributions to make conclusions about whether they differ and by how much (ie, how to analyze the results of an A/B test).

I’ve been tinkering around with R for learning more about the math behind A/B testing and figured I’d share some of the work as I go.

The website Stat Methods has an example showing how to plot a normal distribution for IQ scores, but as a beginner I found it hard to follow so I wound up re-writing it with comments, better variable names, and improved spacing. Also, instead of plotting IQ, I chose to plot men’s heights.

Here’s the final result:

And here’s the code:

# Men's heights are normally distributed with a population mean of 69.0 inches
# and a population standard deviation of 2.8 inches
population_mean <- 69
population_sd <- 2.8
sd_to_fill <- 1
lower_bound <- population_mean - population_sd * sd_to_fill
upper_bound <- population_mean + population_sd * sd_to_fill
# Generates equally spaced values within 4 standard deviations of the mean
# This is used to connect the points on the curve so the more points the better
x <- seq(-4, 4, length = 1000) * population_sd + population_mean
# Returns the height of the probably distribution at each of those points
y <- dnorm(x, population_mean, population_sd)
# Generate the plot, where:
# - type: the type of plot to be drawn where "n" means do not plot the points
# - xlab: the title of the x axis
# - ylab: the title of the y axis
# - main: the overall title for the plot
# - axes: when false it suppresses the axis automatically generated by the high level plotting function so that we can create custom axis
plot(x, y, type="n", xlab = "Height (inches)", ylab = "", main = "Distribution of Men's Heights", axes = FALSE)
# Connect all of the points with each other to form the bell curve
lines(x, y)
# Returns a vector of boolean values representing whether the x value is between the two bounds then
# filters the values so that only the ones within the bounds are returned
bounds_filter <- x >= lower_bound & x <= upper_bound
x_within_bounds <- x[bounds_filter]
y_within_bounds <- y[bounds_filter]
# We want the filled in area to extend all the way down to the y axis which is why these two lines are necessary
# It makes the first point in the polygon (lower_bound, 0) and the last point (upper_bound, 0)
x_polygon <- c(lower_bound, x_within_bounds, upper_bound)
y_polygon <- c(0, y_within_bounds, 0)
polygon(x_polygon, y_polygon, col = "red")
# Now determine the probability that someone falls between the two bounds so we can display it above the curve
# Remember that pnorm returns the probability that a normally distributed random number will be less than the given number
probability_within_bounds <- pnorm(upper_bound, population_mean, sd) - pnorm(lower_bound, population_mean, population_sd)
# Concatenate the various values so we can display it on the curve
text <- paste("p(", lower_bound, "< height <", upper_bound, ") =", signif(probability_within_bounds, digits = 3))
# Display the text on the plot. The default "side" parameter is 3, representing the top of the plot.
mtext(text)
# Add an axis to the current plot, where:
# - side: which side of the plot the axis should be drawn on where 1 represents the bottom
# - at: the points at which the tick-marks are to be drawn
# - pos: the coordinate at which the axis line is to be drawn
sd_axis_bounds = 5
axis_bounds <- seq(-sd_axis_bounds * population_sd + population_mean, sd_axis_bounds * population_sd + population_mean, by = population_sd)
axis(side = 1, at = axis_bounds, pos = 0)

If you’re experienced with R and have any suggestions on how to improve this code, just let me know.

Post navigation

About

I'm a developer at Automattic where I work on the Store Team at WordPress.com. I also built Lean Domain Search, Preceden and a number of other software products over the years. I love solving problems and helping others do the same. Drop me a note if I can help with anything.