Marginalia2018-10-05T19:42:07.753Zhttp://www.claudiobellei.com/Claudio BelleiHexoParallel programming with Julia using MPIhttp://www.claudiobellei.com/2018/09/30/julia-mpi/2018-09-30T20:12:14.000Z2018-10-05T19:42:07.753ZJulia has been around since 2012 and after more than six years of development, its 1.0 version has been finally released. This is a major milestone and one that has inspired me to write a new blogpost (after several months of silence). This time we are going to see how to do parallel programming in Julia using the Message Passing Interface (MPI) paradigm, through the open source library Open MPI. We will do this by solving a real physical problem: heat diffusion across a two-dimensional domain.

This is going to be a fairly advanced application of MPI, targeted at someone that has already had some basic exposure to parallel computing. Because of this, I am not going to go step by step but I will rather focus on specific aspects that I feel are of interest (specifically, the use of ghost cells and message passing on a two-dimensional grid). As I have started doing in my recent blogposts, the code discussed here is only partial. It is accompanied by a fully featured solution that you can find on Github and I have named Diffusion.jl.

Parallel computing has entered the “commercial world” over the last few years. It is a standard solution for ETL (Extract-Transform-Load) applications where the problem at hand is embarassingly parallel: each process runs independently from all the others and no network communication is needed (until potentially a final “reduce” step, where each local solution is gathered into a global solution).

In many scientific applications, there is the need for information to be passed through the network of a cluster. These “non-embarrassingly parallel” problems are often numerical simulations that model problems ranging from astrophysics to weather modelling, biology, quantum systems and many more. In some cases, these simulations are run on tens to even millions of CPUs (Fig. 1) and the memory is distributed - not shared - among the different CPUs. Normally, the way these CPUs communicate in a supercomputer is through the Message Passing Interface (MPI) paradigm.

Anyone working in High Performance Computing should be familiar with MPI. It allows to make use of the architecture of a cluster at a very low level. In theory, a researcher could assign to every single CPU its computational load. He/She could decide exactly when and what information should be passed among CPUs and whether this should happen sinchronously or asynchronously.

And now, let’s go back to the contents of this blogpost, where we are going to see how to write the solution of a diffusion-type equation using MPI. We have already discussed the explicit scheme for a one-dimensional equation of this type. However, in this blogpost we will look at the two-dimensional solution.

The Julia code presented here is essentially a translation of the C/Fortran code explained in this excellent blogpost by Fabien Dournac.

In this blogpost I am not going to present a thorough analysis of scaling speed vs. number of processors. Mainly because I only have two CPUs that I can play with at home (Intel Core i7 processor on my MacBook Pro)… Nonetheless, I can still proudly say that the Julia code presented in this blogpost shows a significant speedup using two CPUs vs. one. Not only this: it is faster than the Fortran and C equivalent codes! (more on this later)

1. First impressions about Julia

I am actually still a newbie with Julia, hence the choice of having a secion on my “first impressions”.

The main reason why I got interested in Julia is that it promises to be a general purpose framework with a performance comparable to the likes of C and Fortran, while keeping the flexibility and ease of use of a scripting language like Matlab or Python. In essence, it should be possible to write Data Science/High-Performance-Computing applications in Julia that run on a local machine, on the cloud or on institutional supercomputers.

One aspect I don’t like is the workflow, which seems sub-optimal for someone like me that uses IntelliJ and PyCharm on a daily basis (the IntelliJ Julia plugin is terrible). I have tried the Juno IDE as well, it is probably the best solution at the moment but I still need to get used to it.

One aspect that demonstrates how Julia has still not reached its “maturity” is how varied and outdated is the documentation of many packages. I still haven’t found a way to write a matrix of Floating point numbers to disk in a formatted way. Sure, you can write to disk each element of the matrix in a double-for loop but there should be better solutions available. It is simply that information can be hard to find and the documentation is not necessarily exhaustive.

Another aspect that stands out when first using Julia is the choice of using one-based indexing for arrays. While I find this slightly annoying from a practical perspective, it is surely not a deal breaker also considering that this not unique to Julia (Matlab and Fortran use one-based indexing, too).

Now, to the good and most important aspect: Julia can indeed be really fast. I was impressed to see how the Julia code that I wrote for this blogpost can perform better than the equivalent Fortran and C code, despite having essentially just translated it into Julia. Have a look at the performance section if you are curious.

2. Installing Open MPI

Open MPI is an open source Message Passing Interface library. Other famous libraries include MPICH and MVAPICH. MVAPICH, developed by the Ohio State University, seems to be the most advanced library at the moment as it can also support clusters of GPUs - something particularly useful for Deep Learning applications (there is indeed a close collaboration between NVIDIA and the MVAPICH team).

All these libraries are built on a common interface: the MPI API. So, it does not really matter whether you use one or the other library: the code you have written can stay the same.

The MPI.jl project on Github is a wrapper for MPI. Under the hood, it uses the C and Fortran installations of MPI. It works perfectly well, although it lacks some functionality that is available in those other languages.

In order to be able to run MPI in Julia you will need to install separately Open MPI on your machine. If you have a Mac, I found this guide to be very useful. Importantly, you will need to have also gcc (the GNU compiler) installed as Open MPI requires the Fortran and C compilers. I have installed the 3.1.1 version of Open MPI, as mpiexec --version on my Terminal also confirms.

Once you have Open MPI installed on your machine, you should install cmake. Again, if you have a Mac this is as easy as typing brew install cmake on your Terminal.

At this point you are ready to install the MPI package in Julia. Open up the Julia REPL and type Pkg.add(“MPI”). Normally, at this point you should be able to import the package using import MPI. However, I also had to build the package through Pkg.build(“MPI”) before everything worked.

3. The problem: diffusion in a two-dimensional domain

The diffusion equation is an example of a parabolic partial differential equation. It describes phenomena such as heat diffusion or concentration diffusion (Fick’s second law). In two spatial dimensions, the diffusion equation writes

The solution $u(x, y, t)$ represents how the temperature/concentration (depending on whether we are studying heat or concentration diffusion) varies in space and time. Indeed, the variables $x, y$ represent the spatial coordinates, while the time component is represented by the variable $t$. The quantity $D$ is the “diffusion coefficient” and determines how fast heat, for example, is going to diffuse through the physical domain. Similarly to what was discussed (in more detail) in a previous blogpost, the equation above can be discretized using a so-called “explicit scheme” of solution. I am not going to go through the details here, that you can find in said blogpost, it suffices to write down the numerical solution in the following form,$$\begin{equation}
\frac{u_{i,k}^{j+1} - 2u_{i,k}^{j}}{\Delta t} =
D\left(\frac{u_{i+1,k}^j-2u_{i,k}^j+u_{i-1,k}^j}{\Delta x^2}+
\frac{u_{i,k+1}^j-2u_{i,k}^j+u_{i,k-1}^j}{\Delta y^2}\right)
\label{eq:diffusion}
\end{equation}$$where the $i, k$ indices refer to the spatial grid while $j$ represents the time index.Assuming all the quantities at the time $j$ to be known, the only unknown is $u_{i,k}^{j+1}$ on the left hand side of eq. (\ref{eq:diffusion}). This quantity depends on the values of the solution at the previous time step $j$ in a cross-shaped stencil, as in the figure below (the red dots represent those grid points at the time step $j$ that are needed in order to find the solution $u_{i,k}^{j+1}$).

Equation (\ref{eq:diffusion}) is really all is needed in order to find the solution across the whole domain at each subsequent time step. It is rather easy to implement a code that does this sequentially, with one process (CPU). However, here we want to discuss a parallel implementation that uses multiple processes.

Each process will be responsible for finding the solution on a portion of the entire spatial domain. Problems like heat diffusion, that are not embarrassingly parallel, require exhchanging information among the processes. To clarify this point, let’s have a look at Figure 3. It shows how processes #0 and #1 will need to communicate in order to evaluate the solution near the boundary. This is where MPI enters. In the next section, we are going to look at an efficient way of communicating.

4. Communication among processes: ghost cells

An important notion in computational fluid dynamics is the one of ghost cells. This concept is useful whenever the spatial domain is decomposed into multiple sub-domains, each of which is solved by one process.

In order to understand what ghost cells are, let’s consider again the two neighboring regions depicted in Figure 3. Process #0 is responsible for finding the solution on the left hand side, whereas process #1 finds it on the right hand side of the spatial domain. However, because of the shape of the stencil (Fig. 2), near the boundary both processes will need to communicate between them. Here is the problem: it is very inefficient to have process #0 and process #1 communicate each time they need a node from the neighboring process: it would result in an unacceptable communication overhead.

Instead, what is common practice is to surround the “real” sub-domains with extra cells called ghost cells, as in Figure 4 (right). These ghost cells represent copies of the solution at the boundaries of neighboring sub-domains. At each time step, the old boundary of each sub-domain is passed to its neighbors. This allows the new solution at the boundary of a sub-domain to be calculated with a significantly reduced communication overhead. The net effect is a speedup in the code.

5. Using MPI

There are a lot of tutorials on MPI. Here, I just want to describe those commands - expressed in the language of the MPI.jl wrapper for Julia - that I have been using for the solution of the 2D diffusion problem. They are some basic commands that are used in virtually every MPI implementation.

MPI commands

MPI.init() - initializes the execution environment

MPI.COMM_WORLD - represents the communicator, i.e., all processes available through the MPI application (every communication must be linked to a communicator)

MPI.Comm_rank(MPI.COMM_WORLD) - determines the internal rank (id) of the process

MPI.Barrier(MPI.COMM_WORLD) - blocks execution until all processes have reached this routine

MPI.Bcast!(buf, n_buf, rank_root, MPI.COMM_WORLD) - broadcasts the buffer buf with size n_buf from the process with rank rank_root to all other processes in the communicator MPI.COMM_WORLD

MPI.Waitall!(reqs) - waits for all MPI requests to complete (a request is a handle, in other words a reference, to an asynchronous message transfer)

MPI.REQUEST_NULL - specifies that a request is not associated with any ongoing communication

MPI.Isend(buf, rank_dest, tag, MPI.COMM_WORLD) - the message buf is sent asynchronously from the current process to the rank_dest process, with the message tagged with the tag parameter

MPI.Irecv!(buf, rank_src, tag, MPI.COMM_WORLD) - receives a message tagged tag from the source process of rank rank_src to the local buffer buf

MPI.Finalize() - terminates the MPI execution environment

5.1 Finding the neighbors of a process

For our problem, we are going to decompose our two-dimensional domain into many rectangular sub-domains, similar to the Figure below

Note that the “x” and “y” axis are flipped with respect to the conventional usage, in order to associate the x-axis to the rows and the y-axis to the columns of the matrix of solution.

In order to communicate between the various processes, each process needs to know what its neighbors are. There is a very convenient MPI command that does this automatically and is called MPI_Cart_create. Unfortunately, the Julia MPI wrapper does not include this “advanced” command (and it does not look trivial to add it), so instead I decided to build a function that accomplishes the same task. In order to make it more compact, I have made extensive use of the ternary operator. You can find this function below,

The inputs of this function are my_id, which is the rank (or id) of the process, the number of processes nproc, the number of divisions in the $x$ direction nx_domains and the number of divisions in the $y$ direction ny_domains.

Let’s now test this function. For example, looking again at Fig. 5, we can test the output for process of rank 4 and for process of rank 11. This is what we find on a Julia REPL:

As you can see, I am using cardinal directions “N”, “S”, “E”, “W” to denote the location of a neighbor. For example, process #4 has process #3 as a neighbor located South of its position. You can check that the above results are all correct, given that “-1” in the second example means that no neighbors have been found on the “North” and “East” sides of process #11.

5.2 Message passing

As we have seen earlier, at each iteration every process sends its boundaries to the neighboring processes. At the same time, every process receives data from its neighbors. These data are stored as “ghost cells” by each process and are used to compute the solution near the boundary of each sub-domain.

There is a very useful command in MPI called MPI_Sendrecv that allows to send and receive messages at the same time, between two processes. Unfortunately MPI.jl does not provide this functionality, however it is still possible to achieve the same result by using the MPI_Send and MPI_Receive functionalities separately.

This is what is done in the following updateBound! function, which updates the ghost cells at each iteration. The inputs of this function are the global 2D solution u, which includes the ghost cells, as well as all the information related to the specific process that is running the function (what its rank is, what are the coordinates of its sub-domain, what are its neighbors). The function first sends its boundaries to its neighbors and then it receives their boundaries. The receive part is finalized through the MPI.Waitall! command, that ensures that all the expected messages have been received before updating the ghost cells for the specific sub-domain of interest.

5. Visualizing the solution

The domain is initialized with a constant value $u=+10$ around the boundary, which can be interpreted as having a source of constant temperature at the border. The initial condition is $u=-10$ in the interior of the domain (Fig. 6left). As time progresses, the value $u=10$ at the boundary diffuses towards the center of the domain. For example, at the time step j=15203, the solution looks like as in Fig. 6right.

As the time $t$ increases, the solution becomes more and more homogenous, until theoretically for $t \rightarrow +\infty$ it becomes $u=+10$ across the entire domain.

6. Performance

I was very impressed when I tested the performance of the Julia implementation against Fortran and C: I found the Julia implementation to be the fastest one!

Before jumping into this comparison, let’s have a look at the MPI performance of the Julia code itself. Figure 7 shows the ratio of the runtime when running with 1 vs. 2 processes (CPUs). Ideally, you would like this number to be close to 2, i.e., that running with 2 CPUs should be twice as fast than running with one CPU. What is observed instead is that for small problem sizes (grid of 128x128 cells), the compilation time and communication overhead have a net negative effect on the overall runtime: the speed-up is smaller than one. It is only for larger problem sizes that the benefit of using multiple processes starts to be apparent.

And now, for the surprise plot: Fig. 8 demonstrates that the Julia implementation is faster than both Fortran and C, for both 256x256 and 512x512 problem sizes (the only ones I tested). Here I am only measuring the time needed in order to complete the main iteration loop. I believe this is a fair comparison, as for long running simulations this is going to represent the biggest contribution to the total runtime.

Conclusions

Before starting this blogpost I was fairly skeptical of Julia being able to compete against the speed of Fortran and C for scientific applications. The main reason was that I had previously translated an academic code of about 2000 lines from Fortran into Julia 0.6 and I observed a performance reduction of about x3.

But this time… I am very impressed. I have effectively just translated an existing MPI implementation written in Fortran and C, into Julia 1.0. The results shown in Fig. 8 speak for themselves: Julia appears to be the fastest by far. Note that I have not factored in the long compilation time taken by the Julia compiler, as for “real” applications that take hours to complete this would represent a negligible factor.

I should also add that my tests are surely not as exhaustive as they should be for a thorough comparison. In fact, I would be curious to see how the code performs with more than just 2 CPUs (I am limited by my home personal laptop) and with different hardware (feel free to check out Diffusion.jl!).

At any rate, this exercise has convinced me that it is worth investing more time in learning and using Julia for Data Science and scientific applications. Off to the next one!

References

]]>
<p>Julia has been around since 2012 and after more than six years of development, its 1.0 version has been finally released. This is a major milestone and one that has inspired me to write a new blogpost (after several months of silence). This time we are going to see how to do parallel programming in Julia using the Message Passing Interface (MPI) paradigm, through the open source library Open MPI. We will do this by solving a real physical problem: heat diffusion across a two-dimensional domain.<br>
Python implementation of Word2Vechttp://www.claudiobellei.com/2018/01/07/backprop-word2vec-python/2018-01-07T18:00:00.000Z2018-05-12T10:31:30.245ZIn this blogpost, I will show you how to implement word2vec using the standard Python library, NumPy and two utility functions from Keras. A more complete codebase can be found under my Github webpage, with a project named word2veclite. This codebase also contains a set of unit tests that compare the solution described in this blogpost against the one obtained using Tensorflow.Our starting point is the theoretical discussion on word2vec that I presented in my previous blogpost. In particular, the objective here is to implement the vectorized expressions presented in that blogpost, since they are compact and computationally efficient. We will be simply coding up a few of those equations.

1. Requirements

For this tutorial we will be using Python 3.6. The packages that we will need are NumPy (I am using version 1.13.3) and Keras (version 2.0.9). Here Keras is only used because of a few useful NLP tools (Tokenizer, sequence and np_utils). At the end of the blogpost I am also going to add a brief discussion on how to implement wordvec in Tensorflow (version 1.4.0), so you may want to import that as well.

2. From corpus to center and context words

The first step in our implementation is to transform a text corpus into numbers. Specifically, into one-hot encoded vectors. Recall that in word2vec we scan through a text corpus and for each training example we define a center word with its surrounding context words. Depending on the algorithm of choice (Continuous Bag-of-Words or Skip-gram), the center and context words may work as inputs and labels, respectively, or vice versa.

Typically the context words are defined as a symmetric window of predefined length, on both the left and right hand sides of the center word. For example, suppose our corpus consists of the sentence “I like playing football with my friends”. Also, let’s say that we define our window to be symmetric around the center word and of length two. Then, our one-hot encoded context and center words can be visualized as follows,

Note that at the boundaries the context words are not symmetric around the center words.

We are essentially interested in writing a few lines of code that accomplish this mapping, from text to one-hot-encoded vectors, as displayed in the figure above. In order to do so, first we need to tokenize the corpus text.

1234567891011

deftokenize(corpus):"""Tokenize the corpus text. :param corpus: list containing a string of text (example: ["I like playing football with my friends"]) :return corpus_tokenized: indexed list of words in the corpus, in the same order as the original corpus (the example above would return [[1, 2, 3, 4]]) :return V: size of vocabulary """ tokenizer = Tokenizer() tokenizer.fit_on_texts(corpus) corpus_tokenized = tokenizer.texts_to_sequences(corpus) V = len(tokenizer.word_index)return corpus_tokenized, V

The function above returns the corpus tokenized and the size $V$ of the vocabulary. The vocabulary is not sorted in alphabetical order (it is not necessary to do so) but it simply follows the order of appearance.

At this point we can proceed with the mapping from text to one-hot encoded context and center words using the function corpus2io which uses the auxiliary function to_categorical (copied from the Keras repository).

And that’s it! To show that the functions defined above do accomplish the required task, we can replicate the example presented in Fig. 1. First, we define the size of the window and the corpus text. Then, we tokenize the corpus and apply the corpus2io function defined above to find out the one-hot encoded arrays of context and center words:

Given an array of real numbers (including negative ones), the softmax function essentially returns a probability distribution with sum of the entries equal to one.

Note that the implementation above is slightly more complex than the naïve implementation that would simply return np.exp(x) / np.sum(np.exp(x)). However, it has the advantage of being superior in terms of numerical stability.

4. Python code for the Multi-Word CBOW model

Now that we can build training examples and labels from a text corpus, we are ready to implement our word2vec neural network. In this section we start with the Continuous Bag-of-Words model and then we will move to the Skip-gram model.

You should remember that in the CBOW model the input is represented by the context words and the labels (ground truth) by the center words. The full set of equations that we need to solve for the CBOW model are (see the section on the multi-word CBOW model in my previous blog post):

As a further note, the output shown above represents a single epoch, i.e., a single pass through the training examples. In order to train a full model it is necessary to iterate across several epochs. We will see this in section 7.

The function skipgram defined above is used similarly to the cbow function, except that now the center words are no more the “labels” but are actually the inputs and the labels are represented by the context words. Using the same user-defined parameters as for the CBOW case, we can run the same example with the following code,

As you can see, the implementation is fairly straightforward even though it did take me some time to get it right.

Although I am not going to show this here, in word2veclite I have added some unit tests that demonstrate how the Python code that we have seen in the previous sections gives virtually identical results to the Tensorflow implementation of this section.

7. Putting it all together

All the code in this blogpost can be put together as a general framework to train word2vec models. As you will know by now, I have already done this “aggregation” in the Python project word2veclite.

Using word2veclite is easy. Let’s go back to our example and suppose that the text corpus consists of the sentence “I like playing football with my friends”. Let’s also assume that we want to define the context words with a window of size 1 and a hidden layer of size 2. Finally, say that we want to train the model for 600 epochs.

and then we can plot loss_vs_epoch, which tells us how the loss function varies as a function of epoch number:

As you can see, the loss function keeps decreasing until almost zero. It does look like the model is working! I am not sure about you, but when I saw this plot I was really curious to see if the weights that the model had learnt were really good at predicting the center words given the context words.

I have added a prediction step in word2veclite that we can use for this purpose. Let’s use it and check, for example, that for the context word “like” - or [0, 1, 0, 0, 0, 0, 0] - the model predicts that the center word is “I” - or [1, 0, 0, 0, 0, 0, 0].

]]>
<p>In this blogpost, I will show you how to implement <tt>word2vec</tt> using the standard Python library, NumPy and two utility functions from Keras. A more complete codebase can be found under my Github webpage, with a project named <a href="https://github.com/cbellei/word2veclite" target="_blank" rel="noopener">word2veclite</a>. This codebase also contains a set of unit tests that compare the solution described in this blogpost against the one obtained using <tt>Tensorflow</tt>.<br>
The backpropagation algorithm for Word2Vechttp://www.claudiobellei.com/2018/01/06/backprop-word2vec/2018-01-06T21:12:14.000Z2018-09-16T12:33:03.575ZSince I have been really struggling to find an explanation of the backpropagation algorithm that I genuinely liked, I have decided to write this blogpost on the backpropagation algorithm for word2vec. My objective is to explain the essence of the backpropagation algorithm using a simple - yet nontrivial - neural network. Besides, word2vec has become so popular in the NLP community that it is quite useful to focus on it.This blogpost is accompanied by another, more practical blogpost that you are <a href=/2018/01/07/backprop-word2vec-python/>invited to read. While this blogpost is going to be more theoretical (although there will be some examples), the other blogpost is focused on a Python implementation of everything discussed here.

Alright then, let’s start from what you need in order to really understand backpropagation. Aside from some basic concepts in machine learning (like knowing what is a loss function and the gradient descent algorithm), from a mathematical standpoint there are two main ingredients that you will need:

some linear algebra (in particular, matrix multiplication)

the chain rule for multivariate functions

If you know these concepts, what follows should be fairly easy. If you don’t master them yet, you should still be able to understand the reasoning behind the backpropagation algorithm. You may also want to <a href=/2018/01/07/backprop-word2vec-python/>skip directly to the Python implementation and come back to this blogpost when needed.

First of all, I would like to start with defining what the backpropagation algorithm really is. If this definition will not make much sense at the beginning, it should make much more sense later on.

1. What is the backpropagation algorithm?

Given a neural network, the parameters that the algorithm needs to learn in order to minimize the loss function are the weights of the network (N.B. I use the term weights in a loose sense, I also mean the bias terms). These are the only variables of the model, that are tweaked at every iteration until we arrive close to the minimum of the loss function.

In this context, backpropagation is an efficient algorithm that is used to find the optimal weights of a neural network: those that minimize the loss function. The standard way of finding these values is by applying the gradient descent algorithm, which implies finding out the derivatives of the loss function with respect to the weights.

For trivial problems, those in which there are only two variables, it is easy to visualize how gradient descent works. If you look at the figure above, you will see in panel (a) a three-dimensional plot of the loss function as a function of the weights $w_1$ and $w_2$. At the beginning, we don’t know what their optimal values are, i.e., we don’t know which values of $w_1$ and $w_2$ minimize the loss function. Let’s say that we start from the red point. If we know how the loss function varies as we vary the value of the weights, i.e., if we know the derivatives $\partial\mathcal{L}/\partial w_1$ and $\partial\mathcal{L}/\partial w_2$, then we can move from the red point to a point closer to the minimum of the loss function, which is represented by a blue point in the figure. How far we move from the starting point is dictated by a parameter $\eta$ that is usually called learning parameter.

A big part of the backpropagation algorithm requires evaluating the derivatives of the loss function with respect to the weights. It is particularly instructive to see this for shallow neural networks, which is the case of word2vec.

2. Word2Vec

The objective of word2vec is to find word embeddings, given a text corpus. In other words, this is a technique for finding low-dimensional representations of words. As a consequence, when we talk about word2vec we are typically talking about Natural Language Processing (NLP) applications.

For example, a word2vec model trained with a 3-dimensional hidden layer will result in 3-dimensional word embeddings. It means that, say, the word “apartment” will be represented by a three-dimensional vector of real numbers that will be close (think of it in terms of Euclidean distance) to a similar word such as “house”. Put another way, word2vec is a technique for mapping words to numbers. Amazing eh?

There are two main models that are used within the context of word2vec: the Continuous Bag-of-Words (CBOW) and the Skip-gram model. We will be looking first at the simplest model, which is the Continuous Bag of Word (CBOW) model with a one-word window. We will then move to the multi-word window case and finally introduce the Skipgram model.

As we move along, I will present a few small examples with a text composed of only a few words. However, keep in mind that word2vec is typically trained with billions of words.

3. Continuous Bag Of Words (CBOW) single-word model in word2vec

In the CBOW model the objective is to find a target word $\hat{y}$, given a context of words (what this means will be clear soon). In the simplest case in which the word’s context is only represented by a single word, the neural network for the CBOW model looks like in the figure below,

As you can see, there is one input layer, one hidden layer and finally an output layer. The activation function for the hidden layer is the identity $a=\mathbb{1}$ (usually, although improperly, called linear activation function). The activation function for the output is a softmax, $a=\mathbb{S}\textrm{oftmax}$.

The input layer is represented by a one-hot encoded vector $\textbf{x}$ of dimension V, where V is the size of the vocabulary. The hidden layer is defined by a vector of dimension N. Finally, the output layer is a vector of dimension V.

Now, let’s talk about weights: the weights between the input and the hidden layer are represented by a matrix $W$, of dimension $V\times N$. Similarly, the weigths between the hidden and the output layer are represented by a matrix $W’$, of dimension $N\times V$. For example, as in the figure, the relationship between an element $x_k$ of the input layer and an element $h_i$ of the hidden layer is represented by the weight $W_{ki}$. The connection between this node $h_i$ and an element $y_j$ of the output layer is represented by an element $W'_{ij}$.

The output vector $\textbf{y}$ will need to be compared against the expected targets $\hat{\textbf{y}}$. The closest is $\textbf{y}$ to $\hat{\textbf{y}}$, the better is the performance of the neural network (and the lower is the loss function).

If some of what you read so far sounds confusing, that’s because it always is! Have a look at the example below, it might clear things up!

ExampleSuppose we want to train word2vec with the following text corpus:$$\textrm{“I like playing football”}$$ We decide to train a simple CBOW model with one context word, just as in the Figure 2 above.

Given the text we have, our vocabulary will only be made of 4 words (hence, $V=4$). Also, we decide to only have two nodes in our hidden layer (hence, N=2). Our neural network will look like this:With the vocabulary being (in order of appearance):$$\textrm{Vocabulary}=[\textrm{“I”}, \textrm{“like”}, \textrm{“playing”}, \textrm{“football”}]$$Next, we define as “target word” the word that follows a given word in the text (which becomes our “context word”). We can now construct our training examples, scanning the text with a window composed of a context word + a target word, like so:For example, the target word for the context word “like” will be the word “playing”. Here is how our training data now looks like:In order to feed this into an algorithm, we need to transform these data into numbers. In order to do so we use a one-hot encoding. For example the word “I”, which appears first in the vocabulary, will be encoded as the vector $[1, 0, 0, 0]$. The word “like”, which appears second in the vocaculary, will be encoded as the vector $[0, 1, 0, 0]$. The same would be for the target words, so the overall set of context-target words will be encoded as in the following table,At this point we want to train our model, finding the weights that minimize the loss function. This corresponds to finding the weights that, given a context vector, can predict with the highest accuracy what is the corresponding target word $\hat{y}$.

3.1 Loss function

Given the topology of the network in Figure 1, let’s write down how to find the values of the hidden layer and of the output layer, given the input data $\textbf{x}$:

where, following [1], $\textbf{u}$ is the value of the output before applying the $\mathbb{S}\textrm{oftmax}$ function.

Now, let’s say that we are training the model against the target-context word pair ($w_t, w_c$). The target word represents the ideal prediction from our neural network model, given the context word $w_c$. It is represented by a one-hot encoded vector. Say that it has value 1 at the position $j^*$ (and the value 0 for any other position).

The loss function will need to evaluate the output layer at the position $j^*$, or $y_{j^*}$ (the ideal value being equal to 1). Since the values in the softmax can be interpreted as conditional probabilities of the target word, given the context word $w_c$, we write the loss function as

The loss function (\ref{eq:loss}) is the quantity we want to minimize, given our training example, i.e., we want to maximize the probability that our model predicts the target word, given our context word.

ExampleLet’s go back to our previous example of the sentence “I like playing football”. Suppose we are training the model against the first training data point: the context word is “I” and the target word is “like”. Ideally, the values of the weights should be such that when the input $\textbf{x}=(1, 0, 0, 0)$ - which corresponds to the word “I” - then the output will be close to $\hat{\textbf{y}}=(0, 1, 0, 0)$ - which corresponds to the word “like”.

As standard for word2vec and in many neural networks, we can initialize the weights $W$ and $W’$ with standard normal distribution. For the sake of it, let’s say that the initial state of $W$, which is a $4\times 2$ matrix, is

3.2 The backpropagation algorithm for the CBOW model

Now that we have an expression for the loss function, eq. (\ref{eq:loss}), we want to find the values of $W$ and $W’$ that minimize it. In the machine learning lingo, we want our model to “learn” the weights.

As we have seen in Section 1, in the neural networks world this optimization problem is usually tackled using gradient descent. Figure 1 shows how, in order to apply this method and update the weigth matrices $W$ and $W’$, we need to find the derivatives $\partial \mathcal{L}/\partial{W}$ and $\partial \mathcal{L}/\partial{W’}$.

I believe the easiest way to understand how to do this is to simply write down the relationship between the loss function and $W$, $W’$. Looking again at eq. (\ref{eq:loss}), it is clear that the loss function depends on the weights $W$ and $W’$ through the variable $\textbf{u}=[u_1, u_2, \dots, u_V]$, or

And that’s it, this is pretty much the backpropagation algorithm! At this point we just need to specify eqs. (\ref{eq:dLdWp}) and (\ref{eq:dLdW}) for our usecase.

Let’s start from eq. (\ref{eq:dLdWp}). Note that the weight $W’_{ij}$, which is an element of the matrix $W’$ and connects a node $i$ of the hidden layer to a node $j$ of the output layer, only affects the output score $u_j$ (and also $y_j$), as seen in the panel (a) of the figure below,

Hence, among all the derivatives $\partial u_k/\partial W'_{ij}$, only the one where $k=j$ will be different from zero. In other words,$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \frac{\partial\mathcal{L}}{\partial u_j}\frac{\partial u_j}
{\partial W'_{ij}}
\label{eq:derivative#1}
\end{equation}$$

Let’s now calculate $\partial \mathcal{L}/\partial u_j$. We have$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial u_j} = -\delta_{jj^*} + y_j := e_j
\label{eq:term#1}
\end{equation}$$where $\delta_{jj^*}$ is a Kronecker delta: it is equal to 1 if $j=j^*$, otherwise it is equal to zero. In eq. (\ref{eq:term#1}) we have introduced the vector $\textbf{e} \in \mathbb{R}^V$, which is used to reduce the notational complexity. This vector represents the difference between the target (label) and the predicted output, i.e., it is the prediction error vector.For the second term on the right hand side of eq. (\ref{eq:derivative#1}), we have$$\begin{equation}
\frac{\partial u_j}{\partial W'_{ij}} = \sum_{k=1}^V W_{ik}x_k
\label{eq:term#2}
\end{equation}$$After inserting eqs. (\ref{eq:term#1}) and (\ref{eq:term#2}) into eq. (\ref{eq:derivative#1}), we obtain$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = (-\delta_{jj^*} + y_j) \left(\sum_{k=1}^V W_{ki}x_k\right)
}
\label{eq:backprop1}
\end{equation}$$

We can go through a similar exercise for the derivative $\partial\mathcal{L}/\partial W_{ij}$, however this time we note that after fixing the input $x_k$, the output $y_j$ at node $j$ depends on all the elements of the matrix $W$ that are connected to the input, as seen in Figure 3b. Therefore, this time we have to retain all the elements in the sum. Before going into the evaluation of $\partial u_k/\partial W_{ij}$, it is useful to write down the expression for the element $u_k$ of the vector $\textbf{u}$ as$$\begin{equation*}
u_k = \sum_{m=1}^N\sum_{l=1}^VW'_{mk}W_{lm}x_l\ .
\end{equation*}$$From this equation it is then easy to write down $\partial u_k/\partial W_{ij}$, since the only term that survives from the derivation will be the one in which $l=i$ and $m=j$, or$$\begin{equation}
\frac{\partial u_k}{\partial W_{ij}} = W'_{jk}x_i\ .
\label{eq:term#3}
\end{equation}$$Finally, substituting eqs. (\ref{eq:term#1}) and (\ref{eq:term#3}) we get our well deserved result:$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V (-\delta_{kk^*}+y_k)W'_{jk}x_i
}
\label{eq:backprop2}
\end{equation}$$

VectorizationWe can simplify the notation of eqs. (\ref{eq:backprop1}) and (\ref{eq:backprop2}) by using a vector notation. Doing so, we obtain for eq. (\ref{eq:backprop1})$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W'} = (W^T\textbf{x}) \otimes \textbf{e}
}
\end{equation}$$where the symbol $\otimes$ denotes the outer product, that in Python can be obtained using the numpy.outer method.

3.3 Applying these results to gradient descent

Now that we have eqs. (\ref{eq:backprop1}) and (\ref{eq:backprop2}), we have all the ingredients needed to go through our first “learning” iteration, that consists in applying the gradient descent algorithm. This should have the effect of getting us a little closer to the minimum of the loss function. In order to apply gradient descent, after fixing a learning rate $\eta>0$ we can update the values of the weight vectors by using the equations$$\begin{eqnarray}
W_{\textrm{new}} & = W_{\textrm{old}} - \eta \frac{\partial \mathcal{L}}{\partial W} \nonumber \\
W'_{\textrm{new}} & = W'_{\textrm{old}} - \eta \frac{\partial \mathcal{L}}{\partial W'} \nonumber \\
\end{eqnarray}$$

3.4 Iterating the algorithm

What we have seen so far is only one small step of the entire optimization process. In particular, up to this point we have only trained the neural network with one single training example. In order to conclude the first pass, we have to go through all our training examples. Once we do this, we will have gone through a full epoch of optimization. At that point, most likely we will need to start again iterating through all our training data, until we reach a point in which we don’t observe big changes in the loss function. At that point, we can stop and declare that our neural network has been trained!

4. The backpropagation algorithm for the multi-word CBOW model

We know at this point how the backpropagation algorithm works for the one-word word2vec model. It is time to add an extra complexity by including more context words. Figure 4 shows how the neural network now looks. The input is given by a series of one-hot encoded context words, their number $C$ depending on our choice (how many context words we want to use). The hidden layer becomes an average of the values obtained from each context word.

Note how these equations are identical to the ones for the single-word CBOW model, the only difference being that we have replaced a single input vector with an averaged one.

5. The backpropagation algorithm for the Skip-gram model

The Skip-gram model is essentially the inverse of the CBOW model. The input is a center word and the output predicts the context words, given the center word. The resulting neural network looks like in the figure below,

6. What’s Next

We have seen in detail how the backpropagation algorithm works for the word2vec usecase. However, it turns out that the implementations we have seen are not computationally efficient for large text corpora. The original paper [2] introduced some “tricks” useful to overcome this difficulty (Hierarchical Softmax and Negative Sampling), but I am not going to cover them here. You can find a good explanation in [1].

Despite being computationally inefficient, the implementations discussed here contain everything that is needed in order to train word2vec neural networks. The next step would then be to implement these equations in your favourite programming language. If you like Python, I have already implemented these equations for you. I discuss about them in <a href=/2018/01/07/backprop-word2vec-python/>my next blogpost. Maybe I’ll see you there!

]]>
<p>Since I have been really struggling to find an explanation of the backpropagation algorithm that I genuinely liked, I have decided to write this blogpost on the backpropagation algorithm for <tt>word2vec</tt>. My objective is to explain the essence of the backpropagation algorithm using a simple - yet nontrivial - neural network. Besides, <tt>word2vec</tt> has become so popular in the NLP community that it is quite useful to focus on it.<br>
Bayesian A/B Testing: a step-by-step guidehttp://www.claudiobellei.com/2017/11/02/bayesian-AB-testing/2017-11-02T21:12:14.000Z2017-11-02T22:49:53.000ZThis article is aimed at anyone who is interested in understanding the details of A/B testing from a Bayesian perspective. It is accompanied by a Python project on Github, which I have named aByes (I know, I could have chosen something different from the anagram of Bayes…) and will give you access to a complete set of tools to do Bayesian A/B testing on conversion rate experiments.The blog post builds from the works of John K. Kruschke, Chris Stucchio and Evan Miller. I believe it gives a comprehensive and critical view on the subject. More importantly, it gives some practical examples that make use aByes. If you are in a hurry and are only interested in the tool, you can skip the boring part and go directly to the case study section. However, if some definitions are not clear I am afraid you will have to go through some of the boring sections. So get your cup of coffee and keep reading :)

Before starting, it is worth mentioning that the topic of A/B testing has seen increased interest over the past few years (so… well done for reading this blog post!). This is obvious from the figure below, showing how the popularity of the search query “AB Testing” in Google Trends has grown linearly for at least the past five years.

Truth is, every web-company with a big enough basin of users does A/B testing (or, at least, it should). Starting from the big giants like Google, Amazon or Facebook. The general public has learnt (and quickly forgot) about A/B testing when in 2013 Facebook released a paper showing a contagion effect of emotion on its News Feed [1], which generated a wave of indignation across the web (for example, see this article from The Guardian).

For anyone interested in introducing these optimization practices (i.e., A/B testing) in their company, there are commercial tools that can help. Some of the most popular ones are Optimizely and Virtual Website Optimizer (VWO). These two tools follow two different viewpoints for doing inferential statistics: Optimizely uses a Frequentist approach, while VWO uses a Bayesian approach.

Here I am not going to digress on the differences between Frequentism and Bayesianism (personally I don’t have a strong preference against one or the other). From now on, we will simply deep dive into the A/B testing world - as seen by a Bayesian.

The main steps needed for doing Bayesian A/B testing are three:1. Collect the data for the experiment;2. Compare the different variants by applying Bayes’ Theorem;3. Decide whether or not the experiment has reached a statistically significant result and can be stopped.

These three steps are visualized in the flow chart below, which give a more complete view of the different choices that a practitioner needs to make. The meaning of these options (analytic/MCMC solution and ROPE/Expected Loss decision rule) will become clear as you keep reading this blog post.

And now, let’s discuss each of these steps individually.

1. Collect the Data

This is the “engineering” step of the lot. As such, I am only going to briefly touch on it.

It is obvious that collecting data is the first thing that should be developed in the experimental pipeline. Users should be randomized in the “A” and “B” buckets (often called the “Control” and “Treatment” buckets). The client (the browser) should send events, typically click events, to a specific endpoint that should be accessible for our analysis. The endpoint could be a database hosted on the backend, or more advanced solutions that are possible with cloud services such as AWS.

If I were to start from scratch I would consider using a backend tool such as Facebook PlanOut for randomizing the users and serving the different variants to the client. I would then use a data warehouse such as Amazon Redshift for storing all the event logs.

It is important to make sure that users are always exposed to the same variant when they access the website during the experiment, otherwise the experiment itself would be invalidated. This requirement can be ensured by using cookies. Alternative solutions are possible if the users can be uniquely identified (for example if they are logged in on the website).

2. Apply Bayes’ theorem

After we begin collecting the data and for each click event we have at least logged the type of event (for example, if it is a click on a “sign in” or on a “register” button), the unique id of the user and the variation in which he/she was bucketed (let’s say A or B), we can start our analysis.

Every time Bayesian methods are applied, it is always useful to write down Bayes’ theorem. After all, it is so simple that it only requires a minimal amount of effort to remember it.

Assume that $H$ is some hypothesis to be tested, typically the parameters of our model (for our purposes, the conversion rates of our A and B buckets), while $\textbf{d}$ is the data collected during our experiment. Then Bayes’ theorem reads$$\begin{equation}
\mathbb{P}(H|\textbf{d}) = \frac{\mathbb{P}(\textbf{d}|H)\mathbb{P}(H)}{\mathbb{P}(\textbf{d})}
\label{eq:Bayes}
\end{equation}$$

What do the different terms mean?- $\mathbb{P}(H|\textbf{d})$: posterior distribution. It is the quantity that we are interested in quantifying. It is the probability that some specific hypothesis is true, after having collected the data $\textbf{d}$. In terms of our A/B test reasoning, $H$ will be the set of all the possible outcomes of the test. For example, that the conversion rate for A is better than for B by +0.1%, +0.2%, -0.1%, -0.2%, … and every point in between and beyond. As you can imagine, the set of all possible outcomes is infinite which means that $\mathbb{P}(H|\textbf{d})$ will be represented by a continuous distribution.- $\mathbb{P}(\textbf{d}|H)$: likelihood function. It represents how likely it is to observe the data we have collected, for a given hypothesis of the model. Again, because $H$ is an infinite set of possibilities $\mathbb{P}(\textbf{d}|H)$ is a continuous distribution.- $\mathbb{P}(H)$: prior distribution. It encodes our information on the model, before observing the data. For example, we may have prior information (through an AA test) that the average conversion rate for A should be centered around 10% with a standard deviation of 1%. Then, our prior for A would be represented by the appropriate normal distribution. Since we don’t have information on B, our best bet would be to assume that B behaves the same as A. In practical terms, this means that the Python project aByes initialises A and B with the same prior.- $\mathbb{P}(\textbf{d})$: evidence. It is a measure of how well our model fits the data (how good our model is). For continuous distributions, here is how the evidence can be evaluated: $$\mathbb{P}(\textbf{d})=\int \mathbb{P}(\textbf{d}|H)\mathbb{P}(H)\textrm{d}H.$$ The evidence often involves doing multi-dimensional integrals and is the most difficult term to evaluate.

Normally, it is very easy to calculate the posterior distribution $\mathbb{P}(H|\textbf{d})$ up to its normalizing constant. In other words, it is usually easy to calculate the terms in the numerator of Bayes’ theorem. What is difficult to do is to evaluate the evidence $\mathbb{P}(\textbf{d})$. While we don’t always care about it, in many cases we are actually interested in it. It is only by knowing its normalizing constant that we can make the posterior distribution an actual probability distribution (that integrates to one), which we can then use to calculate any other quantities of interest (usually called “moments” of the distribution function).

Unfortunately, for an arbitraty choice of the prior distribution $\mathbb{P}(H)$ it is normally only possible to calculate the posterior distribution - including its normalizing constant - through numerical calculations. In particular, it is common to use Markov Chain Monte Carlo methods. However, for specific type of models and for specific choices of the prior it turns out that the posterior distribution can be calculated analytically. Luckily, it is possible to do so for the analysis of A/B experiments.

To summarize, for A/B testing we have two possible choices for finding out the posterior distribution: 1. analytic method; 2. numerical method. Let’s discuss them in some detail.

2.1 Option 1: Analytic solutionIt is often believed that Bayesian methods require more computational power than Frequentist methods. This is a fairly accurate statement, except that in some cases it is possible to solve the Bayesian problem analytically - making the Bayesian solution on par, in terms of speed, with the Frequentist approach.

In order to be able to solve this problem analytically, the prior distributions have to be chosen carefully depending on the form of the likelihood function. For a standard “conversion rate” A/B test, the likelihood function has a particularly simple form: a Bernoulli distribution.

Likelihood function

Most A/B tests involve comparing conversions for users in the Control bucket vs. users in the Treatment bucket. This means that the data will contain, for each user, whether he/she converted or not during the duration of the experiment, like so:

Now let’s call the two arrays of binary values as $\textbf{d}_A$ for the Control group and $\textbf{d}_B$ for the Treatment group. Furthermore, let’s say that the Control group has a probability $p_A$ of converting, while the Treatment group has a probability $p_B$ of doing so. Then, the joint likelihood $\mathbb{P}(\textbf{d}|H)$ that enters into Bayes’ theorem can be evaluated as

$$\begin{equation}
\mathbb{P}(\textbf{d}|H)=\mathbb{P}(\textbf{d}_A,\textbf{d}_B|p_A,p_B)=\mathbb{P}(\textbf{d}_A|p_A)\mathbb{P}(\textbf{d}_B|p_B)=p_A^{c_A}(1-p_A)^{n_A-c_A}p_B^{c_B}(1-p_B)^{n_B-c_B}
\label{eq:likelihood}
\end{equation}$$where we have assumed that the results in the A and B buckets are independent. The quantity in eq. (\ref{eq:likelihood}) represents the probability of observing the data $\textbf{d}_A$ and $\textbf{d}_B$, assuming that the probability of converting are $p_A$ and $p_B$.

Choosing the priors

For specific distributions of the likelihood it is possible to find specific forms of the priors that make the subsequent evaluation of Bayes’ theorem very simple. For the purpose of A/B testing, given the expression (\ref{eq:likelihood}) this “specific form” of the prior is represented by the Beta distribution, which is a continuous distribution defined as a function of $0\leq x \leq 1$ with parameters $\alpha>0$ and $\beta>0$:

This will sound rather obscure, but the point is that given our choice of the prior eq. (\ref{eq:prior}) it turns out that the posterior distribution is also a Beta distribution! Because the prior and posterior distributions belong to the same family of Beta distributions, the prior of eq. (\ref{eq:prior}) is called a conjugate prior of the likelihood function eq. (\ref{eq:likelihood}).

We will see just below the exact expression of the posterior distribution.

Finding the posterior distribution

Given our expressions for the likelihood eq. (\ref{eq:likelihood}) and for the priors eq. (\ref{eq:prior}), it is possible to show that the posterior distribution has the following form:

$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{
\mathbb{P}(H|\textbf{d}) = f(x;\alpha + c_A, \beta + (n_A - c_A))f(x;\alpha + c_B, \beta + (n_B - c_B))
}
\label{eq:posterior_analytic}
\end{equation}$$We will see in a moment that this expression agrees perfectly well with the numerical solution (or, rather, the numerical solution agrees with the analytic one!).2.2 Option 2: Numerical solutionI have fairly extensively talked about pyMC3 in my previous blog post on Bayesian changepoint detection. But fear not, dear reader, there is no need to go through that lengthy blog post to understand how to use pyMC3 for A/B testing. In fact, there is already a pretty good discussion on CrossValidated that has partially inspired this paragraph.

To optimize your learning time, the best thing to do is to look at a Python example. I invite you to copy-paste the code below in a Jupyter notebook and see how the solution looks like.

What is this plot telling us? We should first look at the right column. These are the results of the MCMC iterations. The first iterations have been filtered out, that’s where you would see that the algorithm has not yet converged. Instead, in all the plots on the right column you can see that the MCMC solution does not change much, it fluctuates around a mean value. In other words, we can confidently say that the algorithm has converged to its stationary solution (the posterior distribution) and is now randomly sampling from the posterior distribution.

Now, what about the left column? These are simply the histograms of the data in the right column. If, as discussed, we believe the data on the right are all representative of the posterior distribution (i.e., the MCMC algorithm has converged) then these histograms are the answers to our problem: the posterior distribution of the model parameters $p_A, p_B$ and $\Delta$.

Here is the problem that the Python code above solves: suppose that we do an A/B test in which we look at conversion rates. So it happens that the true probability of conversion for the A group is pA_true=0.04 and the one for the B group is pB_true=0.05.

We obviously don’t know what these true probabilities of conversion are, which is the reason for doing an A/B test in the first place… Okay, so let’s say that we collect data for 2500 unique users in the A group and 1750 in the B group. Let’s say that we are completely in the dark at the beginning, so we choose a uniform prior distribution for both groups.

As explained in the paragraph 2.1, the likelihoods are Bernoulli distributions and this is how I have modeled them in pyMC3.

Note that delta is the distribution of the difference in the mean between group A and group B. Line 22 in the code above is actually redundant, as delta could be simply calculated from the trace of p_A and p_B. To convince yourself that this is the case, you can plot two histograms from the commands plt.hist(trace[‘delta’],bins=100) and plt.hist(trace[‘p_B’] - trace[‘p_A’],bins=100): you will observe that the two histograms are identical.

Now… given the information we have how can we actually find out the joint posterior distribution $\mathbb{P}(H|\textbf{d})=\mathbb{P}(p_A,p_B|\textbf{d})$, which represents the left-hand-side in Bayes’ theorem? Although it is not really needed to calculate it, we can just make a two-dimensional histogram of the traces for pA and pB, like so:

with the result displayed in the following figure:2.3 Analytic vs. Numerical SolutionThe analytic and numerical solutions have both their advantages and disadvantages. The former is fast and stable. The latter is more flexible, in that it allows for any choice of the prior distribution. However the algorithm is much slower and it requires some extra care such as making sure that the numerical solution has reached convergence.

Overall, I think it makes a lot of sense to use the analytic solution for A/B testing on conversion rates. Even in cases in which we would want to use a specific prior for various reasons (for example after an A/A test, the results of which could be used as priors for a subsequent A/B test), the beta distribution would be flexible enough to be able to “fit” most distributions.

Here I am going to show how the analytic expression for the posterior eq. (\ref{eq:posterior_analytic}) and the numerical MCMC solutions agree. This gives us confidence that either of these methods, as described in this blog post, can be chosen.

The case I am going to discuss is the following. Suppose we start an A/B experiment on conversion rates. The prior distributions for both groups are based on a previous AA experiment and are best fitted by a Beta distribution with $\alpha=3$ and $\beta=10$. We start the experiment and we gather data from 200 users in group A and 200 users also in group B. For this particular case, the data are sampled from a distribution with the (unknown) true conversion probabilities of 0.05 and 0.07 for groups A and B, respectively. The experimental data show 5 conversions out of 200 for group A and 17 conversions out of 200 for group B. Given these data, how do the analytic and numerical solutions look like?

First of all, in order to find the analytic solution we should apply eq. (\ref{eq:posterior_analytic}) with parameters $\alpha=3$, $\beta=10$, $c_A=5$, $n_A=200$, $c_B=17$, $n_B=200$.

It turns out that the posterior distributions agrees very well, as expected (and hoped for). The figure below shows the prior distribution of the conversion rate for the two groups as a dotted line (same prior for the two groups). The red and blue lines, both continuous and broken, show that the pyMC3 solution agress very well with the theoretical one.

At this point we can assume that, either analytically or numerically, we have found our posterior distribution. Okay… now what? Well, we need to apply a decision rule that will tell us whether we have a winner or not. Aren’t you curious to see how this works? Me too!

3. Apply Decision Rule

The third step in our flowchart above consists in applying a decision rule to our analyis: is our experiment conclusive? If so, who is the winner?

To answer this question, it is worth pointing out that Bayesian statistics is much less standardized than Frequentist statistics. Not just because Bayesian statistics is still fairly “young” but also very much because of its richness: Bayesian statistics gives us full access to the distribution of the parameters of interest. This means that there are potentially many different ways of making inference from our data. Nevertheless, the methods will probably become more and more standardized over time.

In terms of A/B testing, there seem to be two main approaches for decision making. The first one is based on the paper by John Kruschke at Indiana University, called “Bayesian estimation supersedes the t test” [2]. It is often cited as the BEST paper (yes, that’s called good marketing strategy ;) ). The decision rule used in this paper is based on the concept of Region Of Practical Equivalence (ROPE) that I discuss in Section 3.1.

The other possible route is the one that makes use of the concept of an Expected Loss. It has been proposed by Chris Stucchio [2] and I discuss it in Section 3.2.

3.1 Option 1: Region Of Practical Equivalence (ROPE) MethodIn order to understand this decision rule we need to define the concepts of High Posterior Density Interval (HPD) and Region of Practical Equivalence (ROPE).

3.1.1 High Posterior Density Interval (HPD)

First, let me define the HPD in a formal way before looking at what it means in practice. Here I am borrowing the definition from a nice presentation made available by David Hitchcock [4].

Definition of HPD.

A $100\alpha\%$ HPD region for the parameter $\theta$ is a subset $\mathcal{C}$ of all the possible values of $\theta$ defined by$$\mathcal{C} = {\theta : \mathbb{P}(\theta|\textbf{d}) \geq k}$$where k is the largest number such that$$\int_{k\in \mathcal{C}} \mathbb{P}(\theta|\textbf{d}) \textrm{d}\theta = \alpha .$$

Example.

Assume we are looking for the 95% HPD, which means we take $\alpha=0.95$. To find out the 95% HPD for the distribution in the figure below, we do like this: we start with an horizontal line above the maximum. We then start lowering it and measure what is the integral (the area) of $\mathbb{P}(\theta|d)$ for all the values that are above this horizontal line. We eventually arrive at a value ($k=0.031$ in this case) for which the integral is equal to 0.95. That’s our value $k$ which defines the HPD.

3.1.2 Region Of Practical Equivalence (ROPE)

In order to define the concept of ROPE I think it is best to start by citing the own words of John Kruschke in his paper [2] (page 6, left column): “The researcher specifies a region of practical equivalence (ROPE) around the null value, which encloses those values of the parameter that are deemed to be negligibly different from the null value for practical purposes. The size of the ROPE will depend on the specifics of the application domain. As a generic example, because an effect size of 0.1 is conventionally deemed to be small (Cohen,1988), a ROPE on effect size might extend from −0.1 to +0.1.“

In other words, the ROPE is something that the researcher/data analyst/data scientist should come up before running the experiment as an answer to the following question: “in what circumstances am I going to declare that there is effectively no difference between the two groups A and B”? This will happen when most of the “difference” between the A and B groups lie within the ROPE (don’t worry if this doesn’t make much sense now, it will make much more sense very soon). One possibility is to define the ROPE based on the effect size, as in the quoted text above, but another equally plausible choice could be defining a ROPE based on the difference in the mean values (this quantity is sometimes called lift).

As a reminder, the effect size ES is a measure of the difference in the mean between two groups, normalized with respect to the standard deviation,\begin{equation}\textrm{ES} = \frac{\mu_B - \mu_A}{\sigma}\ ,\label{eq:effect_size}\end{equation}where $\mu_A$ and $\mu_B$ are the sample means for groups A and B, respectively, and $\sigma$ is an unspecified value for the standard deviation. Typically, the standard deviation of the difference of the means is evaluated as\begin{equation}\sigma=\sqrt{\sigma_A^2+\sigma_B^2}\ .\nonumber\end{equation}If $\textrm{ES} = \pm 0.1$, this implies that the difference in mean values between groups A and B is only 10% of the value of the standard deviation. A rather small value and something that could well grant the definition of “negligiby different” between the two groups. Whether this value is too much or too little is left to the researcher (yes, you!).

Example

In this section I am going to explain practically how to apply a Bayesian decision rule based on the ROPE. I will use the effect size as the decision variable (i.e., the variable that defines the ROPE). However, the argument would be exactly the same if we were to use the lift instead, or any other variable.

Below I am presenting three different scenarios. The cyan bars represent the distributions of the effect size defined in eq. (\ref{eq:effect_size}). The green vertical line represents the value ES=0, while the two vertical black broken lines define the ROPE as the locations where ES = $\pm 0.1$. In case (a), the 95% HPD is both inside and outside the ROPE and the experiment is considered inconclusive. In case (b), the 95% HPD is completely outside of the ROPE, on the right hand side of the green line. This means that the experiment is conclusive and group B is declared as the winner. In case (c), the 95% HPD is completely inside the ROPE and the experiment is also conclusive. In this case the “null value” is accepted and groups A and B are considered to be, effectively, as equivalent.

3.2 Option 2: Expected Loss Method An alternative to the ROPE method is to calculate the Expected Loss that we would incur if we were to take the wrong decision. In other words, given the data observed in our experiment, what would be the cost associated in mistakenly declaring variant A (or variant B) as the winner? If this value is smaller than a certain “threshold of caring”, then we can declare variant A (or variant B) to be the winner. After all, if we were wrong the resulting loss would be so low that we would not… care. This method has been proposed by C. Stucchio [3].

The expected loss (due to making the wrong decision) is defined in the following way,$$\begin{equation}
\mathbb{E}(\mathcal{L}) = \min[\mathbb{E}(\mathcal{L}_A), \mathbb{E}(\mathcal{L}_B)]
\label{eq:loss}
\end{equation}$$where $\mathbb{E}(\mathcal{L}_A)$ and $\mathbb{E}(\mathcal{L}_B)$ represent the expected losses associated with choosing “A” or “B” as the winners. These quantities are defined as$$\begin{equation}
\mathbb{E}(\mathcal{L}_A) = \int_0^1 \int_0^1\max(\mu_A - \mu_B,0)\,\mathbb{P}_A(\mu_A|\textbf{d}_A)\mathbb{P}_B(\mu_B|\textbf{d}_B)\,\textrm{d}\mu_A\textrm{d}\mu_B
\label{eq:loss1}
\end{equation}$$and$$\begin{equation}
\mathbb{E}(\mathcal{L}_B) = \int_0^1 \int_0^1 \max(\mu_B - \mu_A,0)\,\mathbb{P}_A(\mu_A|\textbf{d}_A)\mathbb{P}_B(\mu_B|\textbf{d}_B)\,\textrm{d}\mu_A\textrm{d}\mu_B
\label{eq:loss2}
\end{equation}$$In the expressions above, we have used the fact that the joint posterior $\mathbb{P}(\mu_A,\mu_B|\textbf{d}) = \mathbb{P}_A(\mu_A|\textbf{d}_A)\mathbb{P}_B(\mu_B|\textbf{d}_B)$, because the $\mu_A$ and $\mu_B$ distributions are independent. Also, note that the expected loss is always a positive number.

And here is our decision rule: if the expected loss, eq. (\ref{eq:loss}), is lower than our “threshold of caring”, we declare the experiment conclusive and we pick the variant with the smallest expected loss. If both expected losses, eqs. (\ref{eq:loss1}) and (\ref{eq:loss2}), are smaller than the threshold of caring, we declare the experiment conclusive and the two variations to be effectively equivalent.

One way to gain some more intuition on eqs. (\ref{eq:loss}), (\ref{eq:loss1}) and (\ref{eq:loss2}) is to reformulate the integrals in eqs. (\ref{eq:loss1}) and (\ref{eq:loss2}) in terms of the distribution of the lift, $\Delta \mu=\mu_B - \mu_A$. This allows us to transform those two-dimensional integrals into one-dimensional integrals over $\Delta \mu$.

In order to do so, we need to understand what is the distribution of the difference, $\mu_B - \mu_A$, given that we know only the distributions for $\mu_A$ and $\mu_B$. Luckily, there are plenty of references on the distribution of the sum of two random variables, stating that for independent random variables X and Y, with distributions $f_X(X)$ and $f_Y(Y)$, the distribution of the sum Z = X + Y equals the convolution of $f_X(X)$ and $f_Y(Y)$. Similarly, the distribution of the difference Z = X - Y equals the convolution of $f_X(X)$ and $f_Y(-Y)$.

Applying these concepts to our case, we can substitute $f_X(X)$ and $f_Y(Y)$ with $\mathbb{P}_A(\mu_A|\textbf{d}_A)$ and $\mathbb{P}_B(\mu_B|\textbf{d}_B)$, i.e. with the posterior distributions of the mean values $\mu_A$ and $\mu_B$. Then, the distribution of the difference $\Delta \mu = \mu_B - \mu_A$ looks like$$\begin{equation}
\mathbb{P}(\Delta\mu|\textbf{d}) = \int_0^1 \mathbb{P}_B(\mu_B|\textbf{d}_B)\mathbb{P}_A(\mu_B-\Delta\mu|\textbf{d}_A)\textrm{d}\mu_A\ \ ,
\end{equation}$$which indeed represents the convolution of the distributions $\mu_A$ and $\mu_B$.

As mentioned above, we can use this to reduce the form for the expected loss $\mathbb{E}(\mathcal{L}_A)$ from a multi-dimensional integral to a one-dimensional integral over the posterior of the lift $\Delta\mu$,$$\begin{equation}
\mathbb{E}(\mathcal{L}_A) = \int_0^1 \int_0^1 \max(\mu_A - \mu_B,0)\,\mathbb{P}_A(\mu_A|\textbf{d}_A)\mathbb{P}_B(\mu_B|\textbf{d}_B)\,\textrm{d}\mu_A\textrm{d}\mu_B = \\
\int_0^1 \int_0^1 \max(-\Delta\mu,0)\,\mathbb{P}_B(\mu_B|\textbf{d}_B)\mathbb{P}_A(\mu_B-\Delta\mu|\textbf{d}_A)\textrm{d}\mu_A\textrm{d}\Delta\mu = \int_0^1 \max(-\Delta\mu,0)\,\mathbb{P}(\Delta\mu|\textbf{d})\textrm{d}\Delta\mu
\label{eq:L1}
\end{equation}$$where the second equality comes from the substitution $\mu_A \rightarrow \mu_B - \Delta\mu$.

Similarly, for $\mathbb{E}(\mathcal{L}_B)$,$$\begin{equation}
\mathbb{E}(\mathcal{L}_B) = \int_0^1 \max(\Delta\mu,0)\,\mathbb{P}(\Delta\mu|\textbf{d})\textrm{d}\Delta\mu\ .
\end{equation}
\label{eq:L2}$$As promised, we have obtained our one-dimensional integrals, eqs. (\ref{eq:L1}) and (\ref{eq:L2})! We can now plot the integrands in a simple Cartesian x-y axis. I hope this will help, as it has helped me understanding these concepts more deeply.

The expressions (\ref{eq:L1}) and (\ref{eq:L2}) tell us that the expected losses $\mathbb{E}(\mathcal{L}_A)$ and $\mathbb{E}(\mathcal{L}_B)$ are nothing more than the expected values of $\max(-\Delta\mu,0)$ and $\max(\Delta\mu,0)$ under the posterior distribution of the lift $\Delta\mu$.

The figure below shows an example of a particular, arbitrary shape of $\mathbb{P}(\Delta\mu|d)$. Panel (a) shows the function $\mathbb{P}(\Delta\mu|d)$, together with the final values of the expected losses (broken lines). The integrands of eqs. (\ref{eq:L1}) and (\ref{eq:L2}) are displayed in the panels (b) and (c). The broken lines in (a) represent the values of $\mathbb{E}(\mathcal{L}_A)$ and $\mathbb{E}(\mathcal{L}_B)$.

3.3 Summary We have seen how the ROPE and Expected Loss methods can be used as decision rules for our A/B experiments. Here I want to summarize how these methods should be applied.

ROPE Method

From the posterior distribution of the effect size (or lift $\Delta \mu$, or any other decision metric that we choose as our reference metric), calculate the 95% HPD.

If the 95% HPD is within the ROPE, declare the null value to be effectively true. If the 95% HPD is completely outside of the ROPE, declare winner the variation with the largest mean value of the decision metric (effect size, lift…). If the 95% HPD is in between the inside and the outside of the ROPE, declare the experiment inconclusive and continue gathering data (if possible).

Expected Loss Method

From the posterior distribution of the lift $\Delta \mu$, calculate the expected loss.

If the expected loss is smaller than the threshold of caring, declare winner the variation with the smallest value of the expected loss. If they are both smaller than the threshold of caring, declare them as effectively equivalent. If there is no conclusive result, if possible, keep gathering data.

At this point we have all the ingredients that are needed to understand and analyze an A/B experiment through the package aByes.

4. End-to-end Case Study

Let’s apply what we have learnt to a “real” scenario. More examples can be found in the aByes package, under the aByes/examples folder.

After having downloaded and installed the package, we import aByes using the command

1

import abyes as ab

Next, we generate some random data. For this example, we are going to assume that the B variant does better than the A variant by picking values from a Bernoulli distribution with mean values of 0.5 and 0.4, respectively:

The first element of this list is represented by a numpy array of size 10000, each element being a sample drawn by a Bernoulli distribution with probability of success of 0.4. Similarly, the second element is another numpy array of size 10000, but this time the probability of success is 0.5.

The data list represents our experimental data for the A and B buckets. Now let’s see how we can apply the different methods previously discussed to do a Bayesian analysis of these experimental results.

4.1 Analytic solution. Lift as decision variable, ROPE as decision rule. In this scenario, we decide to analyze the experiment by applying the analytic method, with a decision variable based on the lift ($=\mu_B-\mu_A$) and a decision rule based on the ROPE. We choose the ROPE to be between -0.01 and 0.01 and use the 95% HPD region. This means that if there is more than a 95% chance that the absolute difference in conversion rate between the two variants is less than 1%, we declare the two variants to be effectively the same.

Here is how we can apply our Bayesian analysis to this case, using aByes,

4.2 Analytic solution. Lift as decision variable, Expected Loss as decision rule. This time we decide to use the Expected Loss as our decision rule and we set the threshold of caring to be 0.01 (this is also the default value),

4.3 Numerical solution. Effect size as decision variable, ROPE as decision rule. This time we use the numerical (MCMC) solution for our inference (you will need to have pyMC3 installed in your environment) and decide to use the effect size as our decision variable. In terms of decision rule, we again use the ROPE and set $\alpha=0.95$.

4.4 Bonus test. Compare analytic vs. numerical solution. I have decided to add an extra option in aByes, which compares the analytic vs. numerical solutions. For standard conditions, the two solutions should actually be “identical”. This is an extra benchmark on the correctness of the code.

To see that the two methods agree to a very high degree, we can just type

The resulting plot will show that, indeed, the posterior distributions from the analytic and MCMC solutions agree very well,

5. Summary and Conclusion

In this lengthy blog post, I have presented a detailed overview of Bayesian A/B Testing. From the first step of gathering the data to deciding whether to follow an analytic or numerical approach, to choosing the decision rule.

In the previous section, we have also seen some practical examples that make use of the Python package aByes.

Here are my suggestions for using aByes for conversion rate experiments:

Use the analytic solution. There is no point in using the numerical solution at this stage of the aByes project, as it will give “identical” results to the analytic solution. Feel free to use the numerical solution if you are curious about MCMC methods. If you want to use a prior that is not well fitted by a Beta distribution, then you should definitely switch to the numerical solution (in this case you should also make some small changes the code to accommodate the distribution of your choice).

In terms of choosing the decision variable, “lift” (difference in the mean conversion rates of the A and B variants) is the most easy to understand - so I would start from that. I believe “effect size” would be particularly useful for the analysis of revenue (rather than conversion rates), where the distributions can be skewed and it may be important to add information on the actual spread of the data away from the mean value.

In choosing a decision rule, I don’t have a strong preference in favour of the ROPE or of the Expected Loss. At the moment I am fairly agnostic about it, with a slight preference towards the ROPE method as it seems to me to be more robust with respect to skewed distributions.

Thanks for reading this post. And remember to keep using your t-tests and chi-square tests when needed! :)

]]>
<p>This article is aimed at anyone who is interested in understanding the details of A/B testing from a Bayesian perspective. It is accompanied by a <a href="https://github.com/cbellei/abyes" target="_blank" rel="noopener">Python project on Github</a>, which I have named <mark>aByes</mark> (I know, I could have chosen something different from the anagram of Bayes…) and will give you access to a complete set of tools to do Bayesian A/B testing on conversion rate experiments.<br>
The confusion over information retrieval metrics in Recommender Systemshttp://www.claudiobellei.com/2017/06/18/information-retrieval/2017-06-18T18:17:31.000Z2018-09-10T17:57:35.387ZRecently I have been reading a lot about evaluation metrics in information retrieval for Recommender Systems and I have discovered (with great surprise) that there is no general consensus over the definition of some of these metrics. I am obviously not the first one to notice this, as demonstrated by a full tutorial at ACM RecSys 2015 discussing this issue.In the past few weeks I have spent some time going through this maze of definitions. My hope is that you won’t have to do the same after reading this post.

Here I will present a summary of what I have discovered during this time. The material presented is inspired by several sources, which are detailed in the “References” section at the end of this post. I will cite these sources as we proceed, together with identifying edge scenarios, test cases and example Scala code.

We will be looking at six popular metrics: Precision, Recall, F1-measure, Average Precision, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

Before starting, it is useful to write down a few definitions. Let us first assume that there are $U$ users. Each user $i$ ($i=1,\dots,U$) receives a set $R_i$ of recommendations of dimension $\rho_i$ (predictions vector). Out of all these recommendations, only a fraction of them will be actually useful (relevant) to the user. The set of all relevant items represents the “ground truth” or labels vector. Here it will represented by the vector $G_i$, with dimension $\gamma_i$. In summary,

Also, as usual I will denote with TP$_i$ the number of true positives for user $i$ (i.e., the number of recommended items that are indeed relevant) and with FP$_i$ the number of false positives (i.e., the number of recommended items that are not relevant).

At this point we have all the ingredients needed to properly define our metrics. For each metric, we will compute the results related to the four test cases shown here below (click on the arrow to display the text). These will be our “unit tests”, in that any code that implements these metrics (in the way defined here) should recover the same results.

User Test Cases

User#1.$$\begin{equation}
\textrm{Labels} = \{1,2,3,4,5,6\} \\
\textrm{Predictions} = \{1,6,8\}
\nonumber
\end{equation}$$ For this user, the dimension of the label vector is larger than the dimension of the prediction vector.

User#2.$$\begin{equation}
\textrm{Labels} = \{2,4,6\} \\
\textrm{Predictions} = \{1,2,3,4,5\}
\nonumber
\end{equation}$$ For this user, the dimension of the label vector is smaller than the dimension of the prediction vector.

1.1 Edge cases1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.2. There are no recommended documents. Expected behavior: $\textrm{P}_i@k$ should return zero.3. The vector of labels is empty. Expected behavior: $\textrm{P}_i@k$ should return NaN.1.2 Mean Precision@k This is simply the user-averaged value of the Precision@k, $$\textrm{P}@k = \frac{\sum_{i=1}^U\textrm{P}_i@k}{U}$$Edge case. Some precision values are NaNs. Expected behaviour: it should exclude NaN values.1.3 Sources of confusion There is not much to say about the general definition of Precision@k, there is really not much room for misunderstandings here. However, the devil lies in the details and in particular in the edge cases. When calculating the Precision@k for an item that does not have labels (case #3 in the “edge cases” subsection above), should the metric return zero or NaN? Currently the choice of Spark MLlib is to return zero and a warning, as shown in the method precisionAt in the Scala code here (commit f830bb9170f6b853565d9dd30ca7418b93a54fe3 of 29 Nov 2016). What this means is that users for which there are no labels available count as if the recommender did not retrieve any relevant document for them (precision=0). For this reason, personally I would rather use a NaN value in the sense of just discarding cases where the label vectors are empty. In any case, this choice should not be critical as hopefully only a negligible fraction of the label vectors should be empty.1.4 Application to User Test Cases

Using the definition in eq. (\ref{eq:precision}), here are the precision values for the test cases above,

2. Recall@k

Recall is the probability that a relevant item (i.e., an item belongings to the labels vector) is recommended,$$\begin{equation}
\textrm{R}_i@k = \frac{\textrm{TP}_i}{\textrm{TP}_i + \textrm{TN}_i} = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}{\gamma_i}
\label{eq:recall}
\end{equation}$$where as before $\gamma_i$ is the total number of relevant documents (labels).

2.1 Edge cases(These are the same as for the Precision@k metric)1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.2. There are no recommended documents. Expected behavior: $\textrm{R}_i@k$ should return zero.3. The vector of labels is empty. Expected behavior: $\textrm{R}_i@k$ should return NaN.2.2 Mean Recall@k This is simply the user-averaged value of the Recall@k, $$\textrm{R}@k = \frac{\sum_{i=1}^U\textrm{R}_i@k}{U}$$Edge case. Some precision values are NaNs. Expected behaviour: it should exclude NaN values.2.3 Sources of confusion As for the Precision@k metric discussed above, there is a potential confusion arising from the scenario in which the label vector is empty. Actually MLlib does not seem to support recall using the same pattern as with the method precisionAt in RankingMetrics.scala, as there is no equivalent recallAt method. A recall metric is defined in MulticlassMetrics.scala via the method recall, but this does not seem to take care of the case in which the label vector is empty.2.4 Application to User Test Cases

Using the definition in eq. (\ref{eq:recall}), here are the recall values for the test cases above,

3. F1-score@k

The F1-score aims at capturing in a single metric the precision and recall metrics and is usually defined as the harmonic mean between them,$$\begin{equation}
\textrm{F1}_i@k = 2\frac{\textrm{P}_i@k \cdot \textrm{R}_i@k}{\textrm{P}_i@k + \textrm{R}_i@k}
\label{eq:F1-score}
\end{equation}$$

3.1 Edge casesIn addition to the edge cases inherited from the definitions of Precision and Recall, I would add this one: if $P_i@k=0$ and $R_i@k=0 \ \Rightarrow \ F_i@k=0$.3.2 Mean F1-score@k This is simply the user-averaged value of the F1-score@k, $$\textrm{F1}@k = \frac{\sum_{i=1}^U\textrm{F1}_i@k}{U}$$3.3 Sources of confusion The F1-score really just inherits the “confusion” coming from the Precision and Recall metrics, so there is not much to add here.3.4 Application to User Test Cases Using the definition in eq. (\ref{eq:F1-score}), here are the F1-scores for the test cases above,

User Test

F1@1

F1@3

F1@5

#1

2/6/(1+1/6)=2/7=0.286

4/9/(2/3+1/3)=4/9=0.444

4/15/(2/5+1/3)=4/11=0.364

#2

0

2/9/(1/3+1/3)=1/3=0.333

8/15/(2/3+2/5)=1/2=0.500

#3

0

0

0

#4

NaN

NaN

NaN

#5

NaN

NaN

NaN

Mean F1-score

(2/7+0+0)/3=2/21=0.095

(4/9+1/3+0)/3=7/27=0.259

(4/11+1/2+0)/3=19/66=0.288

4. Average Precision@k

This metric represents the average of the precision values for the relevant items from 1 to k,$$\begin{equation}
\textrm{AP}_i@k = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij} \textrm{P}_i@j}{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}
\label{eq:ap}
\end{equation}$$

Compared to the Precision, the Average Precision biases the result towards the top ranks. It is designed to work for binary outcomes (in this sense, the NDCG defined at the end of this post is more general as it also takes into account non-binary relevance).

4.1 Edge cases1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.2. There are less label available than the value $k$ (in other words, $\gamma_i$<$\,k$). The definition above takes care of this scenario by using the value $\min{k,\gamma_i}$ in the denominator.3. There are no recommended documents. Expected behavior: $\textrm{AP}_i@k$ should return zero.4. The vector of labels is empty. Expected behavior: $\textrm{AP}_i@k$ should return NaN.4.2 Mean Average Precision@k (MAP@k) This is simply the average of the Average Precision@k for all users, $$\textrm{MAP}@k = \frac{\sum_{i=1}^U\textrm{AP}_i@k}{U}$$4.3 Sources of confusion Here is complete chaos. The problem is that the denominator in eq. (\ref{eq:ap}) is not the only possible option. We shall see why. The usual formal way in which the average precision is defined is as the area under the precision-recall curve, or$$\begin{equation}
AP = \int P\textrm{d}R \simeq \sum_j P_{@j}\,(R_{@j}-R_{@j-1}) = \sum_j P_{@j}\Delta R_{@j}
\end{equation}$$where P and R stand for Precision and Recall. Note that $\Delta R_{@j} = 0$ when the recall does not change between two consecutive retrieved documents (i.e., when the second recommended document is not relevant). If the recall does change, it is because the next document is relevant and in this case $\Delta R_{@j} = 1/$#relevant documents. The expression above can thus be rewritten as a sum over only the relevant documents, divided by a generic “total number of relevant documents”,$$\begin{equation}
AP = \frac{\sum_j P_{@j}\textrm{rel}_j}{\textrm{number of relevant documents}}\ .
\end{equation}$$Here is the crux of the matter: what is the number of relevant documents (the ground truth)? There are a few options here:

1. The ground truth is only represented by the relevant items that are recommended up to the k-th document. This leads to the denominator being the one used in equation (\ref{eq:ap}). The bottom line is that with this definition, the Average Precision @k represents the average of the relevant recommendations, up to the k-th recommendation. This definition is consistent with several sources, including papers such as this one and this one, the Coursera course from the University of Minnesota and this video from Victor Lavrenko.2. The ground truth is represented by the number of items that are recommended up to a maximum of $k$. This leads to the denominator being $\min\{k,\rho_i\}$, as for example used in this Kaggle definition.3. The ground truth is represented by the actual ground truth vector $G_i$. This leads to the denominator being $\gamma_i$. It is the choice made by MLlib, as can be appreciated by looking at the MLLIB documentation as well as in the function meanAveragePrecisionin the scala source code (commit f830bb9170f6b853565d9dd30ca7418b93a54fe3 of 29 Nov 2016). I don’t agree with this definition as in real Recommender Systems applications each user may have a largely different number of ground truth recommendations and so the Average Precision defined this way seems to be over-dependent on something other from the precision and ranking values.4.4 Application to User Test Cases

Using the definition in eq. (\ref{eq:ap}), here are the Average Precision scores for the test cases above,

5. Reciprocal Rank@k

The reciprocal rank of a set of recommendations served to user $i$ is defined as the inverse position (rank) of the first relevant document among the first k ones,$$\begin{equation}
\textrm{RR}_i@k = \frac{1}{\textrm{rank}_{i,1}}
\nonumber
\end{equation}$$Because of this definition, this metric is mostly suited to cases in which it is the top-ranked result that matters the most.

5.1 Edge caseIf there are no relevant documents within the set of recommendations, it is assumed that $\textrm{rank}_{i,1}=\infty$ and so $\textrm{RR}_i@k=0$.5.2 Mean Reciprocal Rank@k (MRR@k)This is simply the user-averaged value of the Reciprocal Rank@k,$$\begin{equation}
\textrm{MRR}@k = \frac{\sum_{i=1}^{U}\textrm{RR}_i@k}{U}
\label{eq:mrr}
\end{equation}$$5.3 Sources of confusion Here the main source of confusion comes from the fact that the typical definition of the MRR is for search queries (see for example wikipedia),$$\begin{equation}
\textrm{MRR} = \frac{1}{Q}\sum_{i=1}^{Q}\frac{1}{\textrm{rank}_i}
\end{equation}$$where $Q$ is the total number of queries. The temptation (as I have seen for example in a project cloned from MLlib - not the official MLlib release) is therefore to replace queries with recommendations for a single user. In this wrong interpretation, the summation would be over all the relevant documents for a single user. The resulting quantity would then need to be averaged again over all the users to give the final result. Instead, the correct way of translating this formula to Recommender Systems is to replace queries with users. The summation becomes over the users, and each term simply represents the reciprocal rank of only the top relevant document for each user.5.4 Application to User Test Cases Using the definition in eq. (\ref{eq:mrr}), here are the Reciprocal Rank-scores for the test cases above,

6. Normalized Discounted Cumulative Gain@k

The Normalized Discounted Cumulative Gain (NDCG) is similar to MAP, except that NDCG also works for non-binary relevance. NDCG has a heavier tail at high ranks, which means that it does not discount lower ranks as much as MAP does. For this reason, some practitioners prefer MAP to NDCG (for binary outcomes).

To define NDCG, we first need to define the discounted cumulative gain (DCG)$$\begin{equation}
\textrm{DCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{rel}_i}-1}{\ln (i+1)}
\label{eq:dcg}
\end{equation}$$and then the Ideal Discounted Cumulative Gain (IDCG)$$\begin{equation}
\textrm{IDCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{SORT(rel)}_i}-1}{\ln (i+1)}\ .
\label{eq:idcg}
\end{equation}$$The IDCG represents the ideal scenario in which the given recommendations are ranked as best as possible, this is the origin of the SORT function in the definition above.Using eqs. (\ref{eq:dcg}) and (\ref{eq:idcg}) we finally arrive at the definition of the Normalized Discounted Cumulative Gain (NDCG) as$$\begin{equation}
\textrm{NDCG}_i@k = \frac{\textrm{DCG}_i@k}{\textrm{IDCG}_i@k}\ .
\label{eq:ndcg}
\end{equation}$$

6.1 Edge case1. If there are no relevant documents within the set of recommendations, it is assumed that NDCG$=0$.2. If the vector of labels is empty, NDCG should return NaN.6.2 Sources of confusion Just as MAP, I have found NDCG to be defined differently by different authors. The two main sources of confusion that I found are:1. the denominator is sometimes used as a natural logarithm, some other time as a log based 2.2. the IDCG is sometimes calculated without using the SORT function on the recommended items, but rather assuming that an ideal recommender would recommend all relevant items in the best ranking order. In this case the numerator in eq. (\ref{eq:idcg}) would be positive for all items (and strictly equal to one for binary outcomes). This is how MLlib defines NDCG. In my research, I have found the definition in eq. (\ref{eq:idcg}) to be more common, for example this paper (see their Definition 1) or (again) the Coursera course from the University of Minnesota or this video from Victor Lavrenko.6.3 Application to User Test Cases Before going into the actual calculations, it is useful to elucidate what the SORT function in eq. (\ref{eq:idcg}) does. For our test example #2, we have a label vector = {2,4,6} and a prediction vector = {1,2,3,4,5}. The relevance vector is then rel={0,1,0,1,0} and the “ideal” relevance vector is SORT(rel)={1,1,0,0,0}. In other words, an ideal recommender would have ranked items 2 and 4 at the top of the recommendations list. We can now go ahead and use eq. (\ref{eq:ndcg}) for our test cases,

]]>
<p>Recently I have been reading a lot about evaluation metrics in information retrieval for Recommender Systems and I have discovered (with great surprise) that there is no general consensus over the definition of some of these metrics. I am obviously not the first one to notice this, as demonstrated by a <a href="http://recommenders.net/tutorial/recsys2015/" target="_blank" rel="noopener">full tutorial at ACM RecSys 2015</a> discussing this issue.<br>In the past few weeks I have spent some time going through this maze of definitions. My hope is that you won’t have to do the same after reading this post.</p>
Visualizing time dependent networks with d3.jshttp://www.claudiobellei.com/2017/02/04/viznetworks/2017-02-04T13:10:02.000Z2017-02-04T13:03:12.000ZFor a change, here is a post about data visualization. The other day I was thinking about a way of visualizing a time-dependent network in d3.js and in this post I will show a prototype solution.

And there it is, the solution:

As you can see, there is a time slider that allows you to visualize the graph as a function of time. The width of each edge gives information about the strength of the connection between two nodes, the size of each node represents how important a particular node is in a given year.

You can drag the nodes around for a little bit of fun and hovering over each node will tell you the id of the node.

]]>
<p>For a change, here is a post about data visualization. The other day I was thinking about a way of visualizing a time-dependent network in <a href="https://d3js.org/" target="_blank" rel="noopener">d3.js</a> and in this post I will show a prototype solution.</p>
Changepoint Detection. Part II - A Bayesian Approachhttp://www.claudiobellei.com/2017/01/25/changepoint-bayesian/2017-01-25T16:40:07.000Z2018-12-22T14:23:21.171ZI have recently discussed the problem of changepoint detection from a frequentist point of view. In that framework, changepoints were inferred using a maximum likelihood estimation (MLE) approach. This gave us point estimates for the positions of the changepoints.

In this post I will present the solution to the same problem from a Bayesian perspective, using a mix of both theory and practice (using the $\small{\texttt{pymc3}}$ package). The frequentist and Bayesian approaches give actually very similar results, as the maximum a posteriori (MAP) value, which maximises the posterior distribution, coincides with the MLE for uniform priors. In general, despite the added complexity in the algorithm, the Bayesian results are rather intuitive to interpret.

Just as for the frequentist case, the Bayesian problem admits a solution that can be expressed in analytical form. In the first part of this blog post I will first present this solution, which is based on the excellent chapter 5 of the book “Numerical Bayesian Methods Applied to Signal Processing” by Joseph O. Ruanaidh and William Fitzgerald.

In the second part of this post I will present numerical solutions based on the $\small{\texttt{pymc3}}$ package, for different type of problems including multiple changepoints.

1. Single changepoint: analytic Bayesian solution

Let us assume that we only have one changepoint and that the distributions before and after it are well modeled by stationary Gaussian distributions $\mathcal{N}(\mu,\sigma)$ with mean $\mu$ and variance $\sigma^2$.

We now introduce the Bayesian point of view. First of all, note that the only known quantities in the expressions i. and ii. are the observed data $d_i$ and we want to infer the unknowns $\mu_1,\mu_2,\sigma,\tau$ based on the observed data.

“Inferring the model parameters by looking at the observed data” immediately rings like Bayes’ theorem. Let’s write this theorem in all its glorious form, specifying it for our problem,

$\mathbb{P}(\mu_1,\mu_2,\sigma,\tau|\textbf{d})$: posterior distribution. It is the quantity that we are interested in (“inferring the model parameters by looking at the observed data”).

$\mathbb{P}(\textbf{d}|\mu_1,\mu_2,\sigma,\tau)$: likelihood function. It represents how likely it is to observe the data we have, for a given choice of the model parameters. Maximising the likelihood function is the basis for the maximum likelihood estimation (MLE) method that was used in the previous blog post.

$\mathbb{P}(\mu_1,\mu_2,\sigma,\tau)$: prior distribution. It encodes our information on the model parameters, before observing the data.

$\mathbb{P}(\textbf{d})$: evidence. It is a measure of how well our model fits the data (how good our model is). For our specific case it tells us whether or assumptions of assuming one changepoint and a Gaussian process is a good one. It is often difficult to evaluate this term analytically and for the purpose of this discussion, it can be disregarded as being just a multiplication factor.

where $k1, k_2, k_3$ and $k_4$ are all unspecified constants. Note how the resulting priors are all _improper priors, in the sense that they are not integrable (and in particular, they don’t integrate to 1). Note also who the prior for the variance is not uniform but goes like $1/\sigma$, consistent with Jeffrey’s principle for non-informative priors.

Finally… here comes the fun part! The objective of our analytic treatment is to find the posterior distribution of the changepoint $\tau$, or $\mathbb{P}(\tau|\textbf{d})$. This is obtained by marginalizing the posterior distribution $\mathbb{P}(\mu_1,\mu_2,\sigma,\tau|\textbf{d})$ with respect to the nuisance (i.e., “uninteresting”) parameters $\mu_1,\mu_2,\sigma$. In other words, we are not interested in the whole multi-dimensional posterior distribution, which will be impossible even to plot. Instead, we decide to have a lower-dimensional representation of the posterior distribution by looking only at one particular dimension ($\tau$) and integrating over the other dimensions.

Marginilizing with respect to $\mu_1,\mu_2,\sigma$ means nothing more than summing (for discrete distributions) or integrating (for continuous distributions) with respect to the nuisance parameters, like so:

Regarding integration with respect to $\mu_1$ and $\mu_2$, the following identity is useful:$$\begin{equation}
\int_{-\infty}^{\infty}\exp(-ax^2-bx-c)\,\textrm{d}x = \sqrt{\frac{\pi}{a}}\exp\left(\frac{b^2}{4a}-c\right)\ .
\label{eq:intIdentity1}
\end{equation}$$Regarding integration with respect to $\sigma$, this other identity is useful:$$\begin{equation}
\int_0^{\infty}x^{\alpha-1}\exp(-Qx)\,\textrm{d}x=\frac{\Gamma(\alpha)}{Q^\alpha}\ ,
\label{eq:intIdentity2}
\end{equation}$$in which $\Gamma(\alpha)=\int_0^{\infty}x^{\alpha-1}\exp(-x)\,\textrm{d}x$ is the Gamma function.

1.1 Calculate likelihood function

This is based on our model definition, in which we are assuming that all of our data points from $1$ to $\tau$ can be modeled by a normal distribution $\mathcal{N}(\mu_1,\sigma)$ and the remaining ones by another normal distribution $\mathcal{N}(\mu_2,\sigma)$,$$\begin{align}
\mathbb{P}(\textbf{d}|\mu_1,\mu_2,\sigma,\tau) & =\prod_{i=1}^\tau \mathbb{P}(d_i|\mu_1,\sigma)\prod_{j=\tau+1}^N \mathbb{P}(d_j|\mu_2,\sigma) \propto \nonumber \\ & (2\pi\sigma^2)^{-N/2}\exp{\left\{-\frac{1}{2\sigma^2}\left[\sum_{i=1}^\tau (x_i^2+\mu_1^2-2\mu_1x_i) + \sum_{j=\tau+1}^N(x_j^2+\mu_2^2-2\mu_2x_j)\right]\right\}}
\label{eq:likelihood}
\end{align}$$

Now please stay focused as we are going to compare numerical results obtained with $\small{\texttt{pymc3}}$ against this very expression (\ref{eq:bayesian_final}) and we will find a very impressive matching!

2. Numerical Bayesian solution

In this section we are going to see how to use the $\small{\texttt{pymc3}}$ package to tackle our changepoint detection problem. I found $\small{\texttt{pymc3}}$ to be rather easy to use, particularly after a quick introduction to Theano. The key is understanding that Theano is a framework for symbolic math, it essentially allows you to write abstract mathematical expressions in python. Bonus point: it is possible to run Keras on top of Theano (not just on top of Tensorflow).

Before diving into the technical details of the implementation in $\small{\texttt{pymc3}}$, here are my observations regarding solving a changepoint detection problem using a naïve Markov Chain Monte Carlo approach:

convergence can be made significantly faster using the right choice of priors (I know, it is sort of an obvious point…)

because of my previous point, it is difficult to write an efficient algorithm without having some kind of manual inspection of the data. This means that it is difficult to think of a very general solution, one that does not require some tweaking of the parameters

And now… let’s start playing.

2.1 Multiple changepoint detection using pymc3 - in a nutshell

The main idea behind solving a multiple changepoint detection problem in $\small{\texttt{pymc3}}$ is the following: using multiple Theano switch functions to model multiple changepoints. This is quite a simple idea that shows the versatility of Theano.

Let’s make an example from scratch to show how the logic works. In this example we are going to:

define two stochastic variables, $\mu$ and $\sigma$

define the log-likelihood function

initialize and then start the MCMC algorithm

Suppose we are expecting the data to contain two changepoints $\tau_1$ and $\tau_2$ such that we model the observed data points $d_i$ in the following way,

In this problem we have 6 unknown parameters: $\mu_1$, $\mu_2$, $\mu_3$, $\sigma$, $\tau_1$ and $\tau_2$. We want to use pymc3 to find posterior distributions for these parameters (so we are implicityly in a Bayesian framework).

Here is how we can do this in $\small{\texttt{pymc3}}$. First, we have to import the relevant packages (make sure you have installed $\small{\texttt{pymc3}}$ version 3.2 and $\small{\texttt{theano}}$),

At this point we define some synthetic data that we are going to use as our testbed,

123456789101112131415

np.random.seed(100) #initialize random seedsize1 = 1000#size of first part (number of points)size2 = 1000#size of second part (number of points)size3 = 1000#size of third part (number of points)N = size1+size2+size3 #total number of pointsscale = 30#standard deviation of distribution function across all three partsloc1 = 1000#mean of normal for first partloc2 = 1100#mean of normal for second partloc3 = 800#mean of normal for third partd1 = norm.rvs(loc=loc1,size=size1,scale=scale) #randomly generates points for first partd2 = norm.rvs(loc=loc2,size=size2,scale=scale) #randomly generates points for second partd3 = norm.rvs(loc=loc3,size=size3,scale=scale) #randomly generates points for third partdata = np.hstack([d1,d2,d3]) #these are our final observation pointsplt.plot(data) #let's have a look at our dataplt.show()

The code above defines two changepoints at positions 1000 and 2000, in which the mean value changes from 1000 to 1100 and then from 1100 to 800. The standard deviation is set at the value 30 for the whole dataset.

We want $\small{\texttt{pymc3}}$ to find all these parameters. Here is how we can do it:

In the code above, the interesting parts are the definitions of the stochastic variables $\mu$ and $\sigma$ and the definition of the log-likelihood function which is the same as in eq. (\ref{eq:likelihood}), after applying the logarithmic function,

Note that the changepoint variables $\small{\texttt{tau1}}$ and $\small{\texttt{tau2}}$ are initialized as uniform random variables. However, while $\small{\texttt{tau1}}$ spans the whole time domain from 1 to N (where N is the total number of data points), $\small{\texttt{tau2}}$ only spans the values from $\small{\texttt{tau1}}$ to N. In this way we ensure that $\small{\texttt{tau2}}$ > $\small{\texttt{tau1}}$.

The other trick is what we were discussing at the beginning of this section: the stochastic variable $\small{\texttt{mu}}$ is defined after first defining a dummy stochastic variable _$\small{\texttt{mu}}$. In essence, line 16 says that for all the values $\small{\texttt{t}}$ for which $\small{\texttt{tau2}}$>t (i.e., for all the values to the left of $\small{\texttt{tau2}}$), $\small{\texttt{mu}}$ takes the value _$\small{\texttt{mu}}$. Otherwise, it takes the value $\small{\texttt{mu3}}$. In this way, by using twice the switch function we achieve the objective of having a stochastic variables that changes according to whether $t<\tau_1$, $\tau_1\leq t < \tau_2$ or $t \geq \tau_2$.

Finally we can have a look at the result by plotting the trace:

1

pm.traceplot(trace[500:])

(You will see more about the $\small{\texttt{traceplot}}$ function in the next sections. And if you want to see how the plots for this example look like, you can skip immediately to Section 2.4.)

In the last part of this blog post I am going to list a series of problems that are solved using the code $\small{\texttt{changepoint_bayesian.py}}$, which is written in python 3.5 and can be downloaded below,

This code is more general (but also more obscure) than the example given above. Using $\small{\texttt{changepoint_bayesian.py}}$ I will present the solution to a series of problems that range from the single-change-point detection case that was discussed in the analytic solution above (Section 1), up to a three-change-points case. The code can easily be generalized to more change points, it is in fact pretty much ready for it.

2.2 Single changepoint in the mean and comparison with theory

In this problem there is a changepoint at the position $\tau=2500$, where the mean value changes from $\mu_1=1000$ to $\mu_2=1020$, the standard deviation being the same at $\sigma=50$,$$d_i = \left\{
\begin{array}
$\mathcal{N}(1000,50) \ \ \mathrm{for} \ \ t<2500 \\
\mathcal{N}(1020,50)\ \ \mathrm{for} \ \ t \geq 2500
\end{array}
\right.$$

To load this problem and see how the data look like, after downloading $\small{\texttt{changepoint_bayesian.py}}$ you can run the following lines in IPython:

What is the trace plot telling us? We should first look at the right column. These are the results of the MCMC iterations. The first iterations have been filtered out, that’s where you would see that the algorithm has not yet converged. Instead, in all the plots on the right column you can see that the MCMC solution does not change much, i.e. we can confidently say that the algorithm has converged to its stationary solution (the posterior distribution). This means that the algorithm is now randomly sampling from the posterior distribution.

Now, what about the left column? These are simply the histograms of the data in the right column. If, as discussed, we believe the data on the right are all representative of the posterior distribution (i.e., the MCMC algorithm has converged) then these histograms are the answers to our problems: the marginals of the posterior distribution.

With reference to the figure above, here is how we interpret the data:

the top left plot is the marginal $\mathbb{P}(\tau|\textbf{d})$

the plot just below is the marginal $\mathbb{P}(\mu_1|\textbf{d})$

the plot just below is the marginal $\mathbb{P}(\mu_2|\textbf{d})$

the bottom left plot is the marginal $\mathbb{P}(\sigma|\textbf{d})$

When I first saw the top left plot, I was a bit skeptic about the result. This is not at all a smooth solution, I was expecting to see something “nicer”. And this is exactly why it is important to have analytical solutions to compare against!

Once we do that, once we compare the result for $\mathbb{P}(\tau|\textbf{d})$ as given by $\small{\texttt{pymc3}}$ against the analytical result given in eq. (\ref{eq:bayesian_final}), this is what we get

1

In [6]: d.plot_theory()

Wow, the simulation results are pretty close to the theory! And as you can see, the MCMC solution is able to reproduce non-trivial shapes of the probability distribution function. Way to go $\small{\texttt{pymc3}}$!

The resulting trace plot is given below. As before, the posterior estimates are close to the true parameters.

2.5 Three changepoints in the mean

This is our final problem. Here we are going to have not one, not two but three change points! The change point positions are $\tau_1=1000$, $\tau_2=2000$ and $\tau_3=2500$. The mean value changes from $\mu_1=1000$ to $\mu_2=1100$, then to $\mu_3=800$ and finally to $\mu_1=1020$. The standard deviation is kept constant at the value $\sigma=30$.

Once again, the trace plot shows that the MCMC algorithm has eventually converged. The Bayesian estimate for the model parameters are, once again, close to the true value. Good job $\small{\texttt{pymc3}}$!

]]>
<p>I have <a href="/2016/11/15/changepoint-frequentist">recently discussed</a> the problem of changepoint detection from a frequentist point of view. In that framework, changepoints were inferred using a maximum likelihood estimation (MLE) approach. This gave us point estimates for the positions of the changepoints.</p>
<p>In this post I will present the solution to the same problem from a Bayesian perspective, using a mix of both theory and practice (using the <a href="https://pymc-devs.github.io/pymc3/notebooks/getting_started.html" target="_blank" rel="noopener">$\small{\texttt{pymc3}}$</a> package). The frequentist and Bayesian approaches give actually very similar results, as the maximum <em>a posteriori</em> (MAP) value, which maximises the posterior distribution, coincides with the MLE for uniform priors. In general, despite the added complexity in the algorithm, the Bayesian results are rather intuitive to interpret.<br>
Changepoint Detection. Part I - A Frequentist Approachhttp://www.claudiobellei.com/2016/11/15/changepoint-frequentist/2016-11-15T11:41:56.000Z2019-03-03T09:42:59.917ZChangepoint Detection (CPD) refers to the problem of estimating the time at which the statistical properties of a time series… well… change. It originates in the 1950s, as a method used to automatically detect failures in industrial processes (quality control) and it is currently an active area of research that can boast of having a website on its own.

Whatever the application, the general framework is always the same: the underlying probability distribution function of the time series is assumed to change at one (or more) moments in time.

Figure 1 shows two examples of how this might work. Suppose we follow the history of visits on a website for 1000 days. In (a), we clearly see that at the 500th day something happens (maybe Google started indexing the website to the top of its search results). After that moment, the number of visits per day is clearly different. However, things are usually not as good as in this example… we may in fact be in the situation (b) of the figure. If I didn’t tell you that there is a changepoint at the 500th day, would you be able to find it?

The cases above are examples in which the changepoint occurs because of a change in the mean. In other situations, a changepoint can occur because of a change in the variance, as in the figure below.

The idea of a CPD algorithm is to being able to automatically detect the positions of the most likely changepoints and, ideally, determine whether there is statistical evidence for claiming them to be “real”.

I will tackle this problem from a frequentist point of view, with the plan of talking about a bayesian approach in a future post.

As in all hypothesis testing problems, there is a potential issue of multiple testing. I have already talked about this issue. To avoid it, I will make the important assumption that the analysis involves a fixed sample so that we are going to perform one single hypothesis test. In other words the analysis is retrospective, as opposed to online sequential in which there is a stream of data that are continuously analyzed.

1. A frequentist solution.

In searching for an R package on CPD, I found $\texttt{changepoint}$ to be very well written and documented. There is a good paper in the Journal of Statistical Software that describes the package in more detail and can be freely accessed here.

In this post I will give my own interpretation on how to approach the problem, although mainly following the arguments and notation of the paper above and of this review.

First of all, for simplicity I will assume that there is a single changepoint during the whole time period, defined at the discrete times $t=1,2,3,\dots,N$. This case already captures the essence of CPD.

1.1 Model for a single changepoint

Let’s define $\tau$ as the changepoint time that we want to test. Each data point in the time series is assumed to be drawn from some probability distribution function (for example, it could be a binomial or a normal distribution). In this sense, the time series can be considered a realization of a stochastic process. The probability distribution function is assumed to be fully described by the quantity $\theta$ (in general, a set of parameters). For example, $\theta$ could be the probability $p$ of success in a binomial distribution, or the mean and variance in a normal distribution.

At this point the test goes like this: the null hypothesis is that there is no changepoint, while the alternative hypothesis assumes that there is a changepoint at the time $t=\tau$. More formally, here is our hypothesis test:$$\begin{eqnarray}
H_0 &:& \theta_1=\theta_2=\dots=\theta_{N-1}=\theta_N \nonumber \\
H_1 &:& \theta_1=\theta_2=\dots=\theta_{\tau-1}=\theta_\tau\neq\theta_{\tau+1}=\theta_{\tau+2}=\dots=\theta_{N-1}=\theta_N \nonumber
\end{eqnarray}$$

The key in the expression $H_1$ above is in the inequality $\theta_\tau\neq\theta_{\tau+1}$: at some point in the time series, and precisely between $t=\tau$ and $t=\tau+1$, the underlying distribution changes.

The algorithm for detection is based on the log-likelihood ratio. Let’s first define what is the likelihood. The likelihood is nothing else than the probability of observing the data that we have (in the time series), assuming that the null (or the alternative) hypothesis are true. It is a measure of how good the hypothesis is: the highest the likelihood, the higher the data are well fit by the $H_0$ (or $H_1$) assumption.

Assuming independent random variables, under the null hypothesis $H_0$ the likelihood $\mathcal{L}(H_0)$ is given by the probability of observing the data $\mathbf{x}=x_1,\dots,x_N$ conditional on $H_0$. In other words,$$\begin{equation}
\mathcal{L}(H_0)=p(\mathbf{x}|H_0)=\prod_{i=1}^{N}p(x_i|\theta_0)
\label{eq:L1}
\end{equation}$$

Let us define the likelihood of the alternative hypothesis,$$\begin{equation}
\mathcal{L}(H_1)=p(\mathbf{x}|H_1)=\prod_{i=1}^{\tau}p(x_i|\theta_1)\prod_{j=\tau+1}^{N}p(x_j|\theta_2)
\label{eq:L2}
\end{equation}$$

Since $\tau$ is not known, the previous equation becomes a function of $\tau$. We can thus define a generalized log-likelihood ratio $G$, which is the maximum of $\mathcal{R}_\tau$ for all the possible values of $\tau$,$$\begin{equation}
G = \max_{1\leq\tau\leq N}\mathcal{R}_\tau
\label{eq_likelihood}
\end{equation}$$

If the null hypothesis is rejected, then the maximum likelihood estimate of the changepoint is the value $\hat{\tau}$ that maximizes the generalized likelihood ratio,$$\begin{equation}
\hat{\tau} = \underset{1\leq\tau\leq N}{\mathrm{argmax}} \ \mathcal{R}_\tau
\end{equation}$$

In general, the null hypothesis is rejected for a sufficiently large value of $G$. In other words, there is a critical value $\lambda^*$ such that $H_0$ is rejected if$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]
{
2G=2R(\hat{\tau})>\lambda^*
}
\label{eq:criterion}
\end{equation}$$

The factor 2 is retained to be consistent with the $\texttt{changepoint}$ package.

The problem is how to define this critical value $\lambda^*$. The package $\texttt{changepoint}$ has several ways of defining the “penalty” factor $\lambda^*$. Going through the R code, I managed to find the definition of a few of them,

where $k$ is the number of extra parameters that are added as a result of defining a changepoint. For example, it is $k=1$ if there is just a shift in the mean or a shift in the variance.

At this point we are ready to look into a few more specific examples.

1.2 Normally distributed random variables - change in mean

Assume the variables that compose the time series are drawn from independent normal random distributions. We want to test the hypothesis that there is a change in the mean of the distribution at some discrete point in time $\tau$, while we assume that the variance $\sigma^2$ does not change. The probability density function is

Okay, nothing really new so far. Equations (\ref{eq:L3}) and (\ref{eq:L4}) are in fact just specializations of eqs. (\ref{eq:L1}) and (\ref{eq:L2}) to the case of normally distributed random variables.

Now it is time to write the log-likelihood ratio, as in eq. (\ref{eq:LRatio}), obtaining$$\begin{equation}
\bbox[white,5px,border:2px solid red]{
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right)=-\frac{1}{2\sigma^2}\left[\sum_{i=1}^{\tau}(x_i-\mu_1)^2+\sum_{j=\tau+1}^{N}(x_j-\mu_2)^2-\sum_{k=1}^{N}(x_k-\mu_0)^2\right]
}
\label{eq:LR1}
\end{equation}$$

after which we should calculate $G$ and then apply the criterion (\ref{eq:criterion}).

1.3 Normally distributed random variables - change in variance

If there is a change in variance, rather than in mean, the analysis goes pretty similar to the previous section. Let us call $y_0$ and $\sigma_0^2$ the global mean and variance, respectively. Under the null hypothesis, the likelihood is

The log-likelihood ratio is somewhat more complicated than before,$$\begin{eqnarray}
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right)= & N\log\sigma_0-\tau\log\sigma_1-(N-\tau)\log\sigma_2 \nonumber \\ & + \sum_{k=1}^{N}\frac{(x_k-\mu_0)^2}{2\sigma_0^2}-\sum_{i=1}^{\tau}\frac{(x_i-\mu_0)^2}{2\sigma_1^2}-\sum_{j=\tau+1}^{N}\frac{(x_j-\mu_0)^2}{2\sigma_2^2}
\label{eq:Rvar1}
\end{eqnarray}$$However, noting that $\sum_{k=1}^N(x_k-\mu_0)^2=N\sigma_0^2$, $\sum_{i=1}^\tau(x_i-\mu_0)^2=\tau\sigma_1^2$ and $\sum_{j=\tau+1}^N(x_j-\mu_0)^2=(N-\tau)\sigma_2^2$ the three last terms in eq. (\ref{eq:Rvar1}) cancel out. The final expression for $R_\tau$ is then

1.4 Bernoulli distributed random variables - change in mean

The Bernoulli distribution is possibly the easiest distribution of all. It models binary events that have probability $p$ of happening and probability $1-p$ of not happening. This is why it is usually introduced in probability theory with the “coin flipping” example.

1.5 Poisson distributed random variables - change in mean

Here we assume that the random variables follow a Poisson distribution,$$\begin{equation}
f(x|\lambda) = \frac{\lambda^x \textrm{e}^{-\lambda}}{x!}, \ \ x\in \mathbb{N}
\nonumber
\end{equation}$$where $x$ represents the number of events during a pre-defined time interval and $\lambda$ the expected number of events during the same time interval.

2. Applications

We are finally in a position to test the criteria of above for different potential applications. Below I am listing a series of problems with the “solution” underneath. The solution is based on a code that I have written in Python that you can find here,

The code is not optimized, the purpose of it is to show that the methods described above work well at finding single changepoints in time series. The criterion for rejecting (or not) the null is the Bayes Information Criterion (BIC). Feel free to use other criteria.

2.1 Problem 1. Normal distribution: change in mean.

Problem definition: A website receives a certain amount of visits per day. Lately, the marketing team has worked on improving the position of the website on Google searches (Search Engine Optimization). We want to test if this has resulted in an increase in visits on the website.

2.1.1 Parameters of problem 1

Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),

2.2 Problem 2. Normal distribution: change in variance.

Problem definition: Many economists and investors are interested in understanding why financial markets can show abrupt changes in volatility. Assume that such a change in volatility happens in the dataset we are provided. We need to detect when this change happens in the dataset in order to improve our investment model.

2.2.1 Parameters of problem 2

Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),

2.3 Problem 3. Bernoulli distribution: change in mean.

Problem definition: An industrial chain delivers products that need to be manually inspected. Under normal circumstances, about 10% of the products are found with small imperfections that need to taken care of. Recently, a potential bug in the software has been found which may have caused an increase in the inspection failure rate. We want to detect when the bug was introduced and whether the failure rate has increased because of it.

2.3.1 Parameters of problem 3

Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),

2.4 Problem 4. Poisson distribution: change in mean.

Problem definition: Under normal circumstances, a network system has a load of a few tens of hits per minute. Check whether there is a sudden change in the number of hits per minute in the data provided.

2.4.1 Parameters of problem 4

Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),

Probability density distribution: $p(k)\sim \lambda^k\mathrm{e}^{-k}/k!$ where $k$ is the observed number of hits per minute and $\lambda$ is the expected number of hits per minute.

]]>
<p>Changepoint Detection (CPD) refers to the problem of estimating the time at which the statistical properties of a time series… well… change. It originates in the 1950s, as a method used to automatically detect failures in industrial processes (quality control) and it is currently an active area of research that can boast of having <a href="http://www.changepoint.info/" target="_blank" rel="noopener">a website on its own</a>.</p>
Python implementation of Crank-Nicolson schemehttp://www.claudiobellei.com/2016/11/10/crank-nicolson/2016-11-10T15:55:23.000Z2017-07-12T20:25:23.000ZSince at this point we know everything about the Crank-Nicolson scheme, it is time to get our hands dirty. In this post, the third on the series on how to numerically solve 1D parabolic partial differential equations, I want to show a Python implementation of a Crank-Nicolson scheme for solving a heat diffusion problem.

While Phython is certainly not the best choice for scientific computing, in terms of performance and optimization, it is a good language for rapid prototyping and scripting (and in many cases even for complex production-level code).

For a problem of this type, Python is more than sufficient at doing the job. For more complicated problems involving multiple dimensions, more coupled equations and many extra terms, other languages are typically preferred (Fortran, C, C++,…), often with the inclusion of parallel programming using the Message Passing Interface (MPI) paradigm.

The problem that I am going to present is the one proposed in the excellent book “Numerical Solution of Partial Differential Equations” by G.D. Smith (Oxford University Press),

Before I turn to the numerical implementation of a Crank-Nicolson scheme to solve this problem, let’s see already how the solution looks like in the video below.

Your browser does not support the video tag.

As you can see, the maximum of the function $u$ occurs at $t=0$, after which $u$ keeps decreasing. This behaviour is in line with the maximum principle for parabolic equations, which essentially states that the solution of a parabolic equation takes its maximum on the boundary (intended as the time $t=0$ or space boundaries).

The reason for this can be made intuitive by comparing to the case of a metallic one-dimensional rod with the sides that are kept at some fixed temperature and with a temperature distribution that is linear and maximum at the centre, as in the initial condition above. As time progresses, the two “heat sources” (or sinks) at the sides are kept at constant low temperature. The diffusion of heat results in the rod becoming colder and colder until its temperature becomes equal to the temperature at the boundaries.

Let’s now talk about the numerical solution of the problem above. As already discussed, the numerical solution has to solve for the following matrix equation

]]>
<p>Since at this point we know <a href="/2016/11/01/implicit-parabolic/">everything about the Crank-Nicolson scheme</a>, it is time to get our hands dirty. In this post, the third on the series on how to numerically solve 1D parabolic partial differential equations, I want to show a Python implementation of a Crank-Nicolson scheme for solving a heat diffusion problem.</p>
Implicit solution of 1D parabolic PDE (Crank-Nicolson scheme)http://www.claudiobellei.com/2016/11/01/implicit-parabolic/2016-11-01T19:46:28.000Z2018-01-10T08:42:55.000ZThis post is the second one of the series on how to numerically solve 1D parabolic partial differential equations (PDEs). In my previous post I discussed the explicit solution of 1D parabolic PDEs and I have also briefly motivated why it is interesting to study this type of problems.

We have seen how an explicit scheme of solution may require the time step to be constrained to unacceptably low values, in order to guarantee stability of the algorithm.

In this post we will see a different scheme of solution that does not suffer from the same constraint. So let’s go back to the basic equation that we want to solve numerically,\begin{equation}\frac{\partial u}{\partial t} = D\frac{\partial^2 u}{\partial x^2} \ + \mathrm{I.B.C.}\label{eq:diffusion}\end{equation}

In an explicit scheme of solution we use forward-differencing for the time derivative $\partial u/\partial t$, while the second derivative $\partial^2 u/\partial x^2$ is evaluated at the time step $j$.

In an implicit scheme of solution, we use a different way of differencing eq. (\ref{eq:diffusion}). We still use forward-differencing for the time derivative, but the spatial derivative is now evaluated at the time step $j+1/2$, instead of the time $j$ as before, by taking an average between the time step $j$ and the time step $j+1$,

Note how the terms inside the red boxes are all unknown, as they are the values at the next time step (the values at the timestep $j$ is instead assumed to be known).

Intuitively, this seems a more precise way of differencing than the explicit scheme since we can now argue that the derivatives on both the left and right hand sides are evaluated at the time $j+1/2$. The scheme of eq. (\ref{eq:CN}) is called Crank-Nicolson after the two mathematicians that proposed it. It is a popular way of solving parabolic equations and it was published shortly after WWII. The Crank-Nicolson scheme has the big advantage of being a stable algorithm of solution, as opposed to the explicit scheme that we have already seen.

The disadvantage is that it is computationally more difficult to solve eq. (\ref{eq:CN}), as there are now four unknowns in the equation, instead of just one as in the explicit scheme. As the figure below shows, the solution $u(i,j+1)$ depends on quantities already known from the previous iteration, just as in the explicit scheme: $u(i-1,j), u(i,j)$ and $u(i+1,j)$. However, it also depends on quantities at the current iteration: $u(i-1,j+1), u(i,j+1)$ and $u(i+1,j+1)$).

Fig 3. Stencil (domain of dependence) of the solution $u(i,j+1)$ for the Crank-Nicolson scheme.

How can we find all the values of the solution $u$ at the time step $j+1$? Since it is not possible to calculate $u(i,j+1)$ directly (explicitly) but only indirectly (implicitly), the Crank-Nicolson scheme belongs to the family of schemes named implicit. The trick is to solve for all the equations $i=2\dots N-1$ (the values at $i=1$ and $i=N$ are known via the boundary conditions) at once. These equations define a linear system, with $N-2$ unknowns in $N-2$ equations.

After rearranging the unknown values to the left and the known values to the right, equation (\ref{eq:CN}) looks like

For each inner grid point $i=2,3,\dots N-1$ (as noted, the points at $i=1$ and $i=N$ are supposed to be known from the boundary conditions) there will be an equation of the same form as eq. (\ref{eq:CN1}). The whole set of equations can be rewritten in matrix form as

Now comes the difficult part: evaluating the stability of the Crank-Nicolson scheme. It can be shown that the matrix $B’:=A^{-1}B$ is symmetric (and real), therefore its 2-norm $||\cdot||_2$ is equivalent to its spectral radius $r(B’)=\max_k |\lambda_k|$, where $\lambda_k$ are the eigenvalues of $B’$,

The result $||B'||_2<1 \ \forall \ r>0$ ensures that the Crank-Nicholson scheme is unconditionally stable, in the sense that the stability condition $||B'||_2<1$ does not depend on the choice of the grid size, time step and diffusion coefficient (parameters that enter in the definition of $r=D\Delta t/\Delta x^2$).

As a final note, it is important to point out that the Crank-Nicolson scheme has a truncation error of $O(\Delta t^2,\Delta x^2)$ as opposed to the explicit scheme that is $O(\Delta t,\Delta x^2)$.

In the next post of this series, I will be looking at a practical implementation of the Crank-Nicolson scheme using Python. That’s it for now, time to look at some computer-generated art…

⁂

Today’s Mondrian random generator

]]>
<p>This post is the second one of the series on how to numerically solve 1D parabolic partial differential equations (PDEs). In my <a href="/2016/10/15/explicit-parabolic">previous post</a> I discussed the explicit solution of 1D parabolic PDEs and I have also briefly motivated why it is interesting to study this type of problems.</p>
The multiple hypothesis testing problemhttp://www.claudiobellei.com/2016/10/30/multiple-testing-part1/2016-10-30T12:22:32.000Z2017-07-12T21:09:24.000ZI must admit that I only learnt about the “multiple testing” problem in statistical inference when I started reading about A/B testing. In many ways I knew about it already, since the essence of it can be captured by a basic example in probability theory: suppose a particular event has a chance of 1% of happening. Now, if we make N attempts what is the probability that this event will have happened at least once among the N attempts?

As we will see more in detail below, the answer is $1-0.99^N$. For $N=5$, the event has already almost a $5\%$ chance of happening at least once in five attempts.

Fine… so what’s the problem with this? Well, here is why it can complicate things in statistical inference: suppose this 1% event is “rejecting the null hypothesis $H_0$ when the null is actually true”. In other words, committing a Type-I error in hypothesis testing. Continuing with the parallel, “making N attempts” would mean making N hypothesis tests. That’s the problem: if we are not careful, making multiple hypothesis tests can dangerously imply underestimating the Type-I error. Things are not that funny anymore right?

The problem becomes particularly important now that streaming data are becoming the norm. In this case it may be very tempting to continue collecting data and perform tests after tests… until we reach statistical significance. Uh… that’s exactly when the analysis becomes biased and things get bad.

The problem of bias in hypothesis testing is much more general than what is in the example above. In particle physics searches, the need for reducing bias goes as far as performing blind analysis, in which the data are scrambled and altered until a final statistical analysis is performed.

Let’s now go back to our “multiple testing” problem and be a bit more precise about the implications. Suppose we formulate a hypothesis test by defining a null hypothesis $H_0$ and alternative hypothesis $H_1$. We then set a type-I error $\alpha$, which means that if the null hypothesis $H_0$ were true, we would incorrectly reject the null with probability $\alpha$.

In general, given $n$ tests the probability of rejecting the null in any of the tests can be written as$$\begin{equation}
P(\mathrm{rejecting\ the\ null\ in \ any \ of \ the \ tests})=P(r_1\lor r_2\lor\dots\lor r_n)
\label{eq:prob}
\end{equation}$$in which $r_j$ denotes the event “the null is rejected at the j-th test”.

While it is difficult to evaluate eq. (\ref{eq:prob}) in general, the expression greatly simplifies for independent tests as it will be clear in the next section.

where $ r_j^* $ denotes the event “the null is NOT rejected at the j-th test”.

What is the consequence of all this? Let’s give an example.

Suppose that we do a test where we fix the type-I error $\alpha=5\%$. By definition, if we do one test only we will reject the null 5% of the times if the null is actually true (making an error…). What if we make 2 tests? What are the chances of committing a type-I error then? The error will be

If multiple testing is unavoidable (for example, we are testing multiple hypothesis in a single test because we have multiple groups), then we can just correct the type-I error as an effective type-I error $\alpha_\mathrm{eff}$. In order to recover the original type-I error (thereby “controlling” the multiple testing), we must ensure that

Note that in order to arrive at eq. (\ref{eq:alpha_eff}), we have assumed $\alpha\ll 1$ and use Taylor’s expansion $$(1-x)^m\approx 1-mx$$

What we have just shown is that, for independent tests, we can take into account the multiple hypothesis testing by correcting the type-I error by a factor $1/n$ such that$$\bbox[lightblue,5px,border:2px solid red]{\alpha_\mathrm{eff} = \frac{\alpha}{n} }$$

This is what goes under the name of Bonferroni correction, from the name of the Italian mathematician Carlo Emilio Bonferroni.

2. Practical application

In this section we are going to do several t-tests on independent sets of data with the null hypothesis being true. We will see that eq. (\ref{eq:typeI_independent}) fits well the real type-I error.

By our definition, the null $H_0$ is true as we are going to set $\mu_A=\mu_B=0.5$ (hence $\Delta \mu=0$). The figure below shows the probability of committing a type-I error as a function of the number of independent t-tests, assuming $\alpha=0.05$. Without correction, the Monte Carlo results are well fitted by eq. (\ref{eq:typeI_independent}) and show a rapid increase of the type-I error rate. Applying the Bonferroni correction does succeed in controlling the error at the nominal 5%.

Below is the Python code that I used to produce the figure above (code that you can also download). The parameter nsims that I used for the figure was 5000, but I had my machine running for a couple of hours so I decided to use 1000 as default. Give it a try, if you are curious!

]]>
<p>I must admit that I only learnt about the “multiple testing” problem in statistical inference when I started reading about A/B testing. In many ways I knew about it already, since the essence of it can be captured by a basic example in probability theory: suppose a particular event has a chance of 1% of happening. Now, if we make N attempts what is the probability that this event will have happened at least once among the N attempts?</p>
Explicit solution of 1D parabolic PDEhttp://www.claudiobellei.com/2016/10/15/explicit-parabolic/2016-10-15T13:05:07.000Z2018-06-06T18:15:27.460ZThis article started as an excuse to present a Python code that solves a one-dimensional diffusion equation using finite differences methods. I then realized that it did not make much sense to talk about this problem without giving more context so I finally opted for writing a longer article. I have divided this article into three posts, of which this is the first one.The first two blog posts (including this one) are dedicated to some basic theory on how to numerically solve parabolic partial differential equations (PDEs). The third blog post will be dedicated to showing a short python code that solves one particular parabolic problem.

Below is a list of the topics of this blog post

a short introduction to parabolic PDEs

the basics for solving PDEs: finite difference methods

the simplest algorithm of solution: explicit solution

stability constraints (or, when does the explicit solution fail?)

1. Parabolic PDEs

Parabolic partial differential equations model important physical phenomena such as heat conduction (described by the heat equation) and diffusion (for example, Fick’s law). Under an appropriate transformation of variables the Black-Scholes equation can also be cast as a diffusion equation. I might actually dedicate a full post in the future to the numerical solution of the Black-Scholes equation, that may be a good idea…

where $D$ is a diffusion/heat coefficient (for simplicity, assumed to be independent of the time $t$ and space $x$ variables) and I.B.C. stands for “initial and boundary conditions”. The unknown $u$ represents a generic quantity of interest, which depends on the specific problem (could be temperature, concentration, price of an option,…).

The boundary conditions specify the constraints that the solution has to fulfill at the boundaries of the domain of interest. For example, let’s assume we have to solve a problem of heat conduction along a metallic rod as in the figure below. We keep the two sides of the rod at a fixed temperature $T^*$ and the initial temperature profile inside the domain is given by a function $T_0(x)$, which in general is dependent on the coordinate $x$.This problem can be well approximated by a 1D model of heat conduction (as we assume that the length of the rod is much larger than the dimensions of its section). For simplicity, let’s assume $D=1$ in eq. (\ref{eq:diffusion}). The heat diffusion problem requires then to find a function $T(x,t)$ that satisfies the following equations

where this time I have explicitly written the initial and boundary conditions. This is now a well defined problem that can be solved numerically.

The numerical solution involves writing approximate expressions for the derivatives above, via a method called finite differences, that I describe briefly below.

2. Finite differences

Derivatives, such as the quantities $\partial u/\partial t$ and $\partial^2 u/\partial x^2$ that appear in eq. ($\ref{eq:diffusion}$), are a mathematical abstraction that can only be approximated when using numerical techniques.

Going back to the very definition of the derivative of a one-dimensional function $f(x)$, the derivative $f’(x)$ of this function at the point $x$ is defined as

Now, while a CPU can easily calculate the differences $\Delta f$ and $\Delta x$, it is a different story to calculate the limit $\lim(\Delta x\rightarrow 0)$. That is not a basic arithmetic operation. So, how to calculate the derivative? The trick is to calculate the ratio $\Delta f/\Delta x$ for a “sufficiently small” value of $\Delta x$. We can in fact approximate the derivative in different ways, making use of Taylor’s theorem

Higher order (or mixed) derivatives can be calculated in a similar fashion. The important point is that the approximations (\ref{eq:fd2},\ref{eq:fd3},\ref{eq:fd4},\ref{eq:fd5}) form the building blocks for the numerical solution of ordinary and partial differential equations.

In the next section I will discuss a first algorithm that uses these approximations to solve eq. ($\ref{eq:diffusion}$) and goes under the name of explicit solution. It is a fast and intuitive algorithm, that however has the drawback of only working under some well defined conditions, otherwise the solution is unstable (i.e., the numerical solution will get NaN values pretty quickly…).

3. Explicit solution to the diffusion equation

Let’s go back to eq. ($\ref{eq:diffusion}$) and think about a possible way of solving this problem numerically. Using the results of the previous section, we can think of discretizing the derivative $\partial u/\partial t$ using any of the formulas above (central/forward/backward differencing) and the derivative $\partial u^2/\partial x^2$ using eq. (\ref{eq:fd2}). As it turns out, for stability considerations it is better to avoid central differencing for $\partial u/\partial t$. We will use instead forward differencing (we could also choose backward differencing), and the fully discretized equation then looks like$$\begin{equation}
\frac{u_{i,j+1}-u_{i,j}}{\Delta t} = D\frac{u_{i+1,j}-2u_{i,j}+u_{i-1,j}}{\Delta x^2},
\label{eq:fd}
\end{equation}$$to which the initial and boundary conditions need to be added.

Note how eq. (\ref{eq:fd}) implies having already defined a space and time grids. Let’s assume the spatial resolution is $\Delta x$ and the temporal resolution is $\Delta t$. Then if $u(i,j)$ is the solution at the position $x$ and time $t$, $u(i+1,j+1)$ represents the solution at the position $x+\Delta x$ and at the time $t + \Delta t$. Visually, the space-time grid can be seen in the figure below.

Fig 2. Space-time grid for the problem defined in eq. (\ref{eq:fd}). The red dots are known through the initial and boundary conditions, the black dots are initially unknown and calculated via the explicit algorithm.

The initial and boundary conditions are represented by red dots in the figure. The space grid is represented by the index $i$ in eq. (\ref{eq:fd}), while the time is represented by the index $j$. At the beginning of the simulation, the space domain is discretized into $N$ grid points, so that $i=1\dots N$. The index $j$ starts from $j=1$ (the initial condition) and runs for as long as needed to capture the dynamics of interest.

Looking at the figure above, it is clear that the only unknown in eq. (\ref{eq:fd}) is $u_{i,j+1}$. Solving for this quantity, we get$$\begin{equation}
u_{i,j+1} = r\,u_{i+1,j} + (1-2r)u_{i,j} + r\,u_{i-1,j}
\label{eq:explicit}
\end{equation}$$where $r = D\Delta t/\Delta x^2$. As shown in Figure 3, the unknown $u_{i,j+1}$ only depends on quantities at the previous time step $j$, and only at the grid points $i-1, i$ and $i+1$.

Fig 3. Stencil (domain of dependence) of the solution $u(i,j+1)$ for the explicit scheme of solution.

Equation (\ref{eq:explicit}) can be recast in matrix form. Remember that the boundary conditions are given at the gridpoints $i=1$ and $i=N$, so the real unknowns are the values at the gridpoints $i=2,\dots,N-1$ and at the timestep $j+1$, giving

where $B$ is a square $(N-1)\times (N-1)$ matrix and $\textbf{b}_{j}$ is a vector of boundary conditions. The matrix $B$ is tri-diagonal.

It is pretty easy to implement the solution (\ref{eq:matrix2}) in a code, but there is a potential big problem that we should be aware of when using this type of algorithm (explicit): the solution can be unstable! This means that any small and unavoidable round-off error can degenerate to a huge error (and eventually NaNs) after a few hundreds or thousands of iterations.

How can we make sure we don’t suffer from this problem? That’s the subject of the next section.

2.1 Stability condition

Equation ($\ref{eq:matrix2}$) is only an approximate solution of eq. ($\ref{eq:diffusion}$). In fact, it is not even possible to solve exactly eq. ($\ref{eq:matrix2}$), let alone eq. (\ref{eq:diffusion}), because of round-off errors. How can we guarantee that these initially small errors will not accumulate over many iterations causing a catastrophic runtime error during execution of our code?

Let’s do a direct calculation: let’s assume that the true solution of ($\ref{eq:matrix2}$) - which as noted is already an approximation of the original problem - is given by the quantity $\textbf{u}_1$, while due to roundoff-errors our code finds the solution $\textbf{u}_1^*$. What is going to happen to the error $\textbf{e}_1 = \textbf{u}_1 - \textbf{u}_1^*$? Is it going to be amplified at successive iterations, or will it always be possible to bound the propagated error by some number independent of the iteration step $j$?

To answer this question, we can apply recursively eq. (\ref{eq:matrix2}) to find

where $B^j=B\times B\dots \times B$ ($j$ times). It follows that at the timestep $j$ the difference of the solutions will be $$\textbf{e}_j=\textbf{u}_j-\textbf{u}^*_j=B^j \textbf{e}_1,$$which implies$$||\textbf{e}_j||\leq||B^j||\,||\textbf{e}_1||\ .$$

According to Lax and Ritchmyer, for the numerical scheme to be stable there should exist a positive number M independent of $j, \Delta x$ and $\Delta t$ such that $||B^j||\leq M \ \forall \ j\ $, so that

This ensures that a small error at the first timestep, $\textbf{e}_1$, will not propagate catastrophically at subsequent timesteps.

The necessary and sufficient condition for the Lax-Ritchmyer stability condition (\ref{eq:lax-ritchmyer}) to be satisfied is $||B||\leq 1$. For the explicit method of eqs. (\ref{eq:matrix1}) and (\ref{eq:matrix2}), it can be shown that a sufficient condition to guarantee that $||B||\leq 1$ is

Unfortunately, in many practical applications it is not possible to satisfy eq. (\ref{eq:stability}) without strongly compromising the efficiency of an algorithm. For example during the implosion of deuterium (D) and tritum (T) spherical pellets in inertial confinement fusion (ICF) experiments, the D and T ions are subject to mass diffusion (à la Fick). The variable $u$ of eq. (\ref{eq:diffusion}) is in this case the species concentration. In particular conditions the diffusion coefficient $D$ may reach very large values, on the order of $10^2$ m$^2/$s. Now, usually in ICF simulations the grid size is constrained to resolve distances on the order of a $\mu$m (i.e., $\Delta x \sim 10^{-6}$ m). Applying eq. (\ref{eq:stability}), we see that to guarantee stability with an explicit scheme the time step would need to be kept to values $\Delta t\sim 10 $ fs or $10^{-14}$s. Typical ICF codes run with time steps on the order of $10^{-12}$ s. If we were to model species diffusion with an explicit scheme, we would then slow down an ICF code by 2 orders of magnitude (!!).

Is it possible to avoid having a time step constrained by the condition (\ref{eq:stability})? Yes, luckily it is. We need to use a different scheme of solution called implicit. This comes at the price of having to solve a set of equations that is numerically more demanding. We will see in the next post of the series how a popular implicit scheme (Crank-Nicolson) works.

⁂

Today’s Mondrian random generator

]]>
<p>This article started as an excuse to present a Python code that solves a one-dimensional diffusion equation using <a href="https://en.wikipedia.org/wiki/Finite_difference_method" target="_blank" rel="noopener">finite differences methods.</a> I then realized that it did not make much sense to talk about this problem without giving more context so I finally opted for writing a longer article. I have divided this article into three posts, of which this is the first one.<br>