or "OPIG" to friends

If you have done any computational work, you must have spent some time waiting for your program to run. As an undergraduate, I expected computational biology to be all fun and games: idyllic hours passing time while the computer works hard to deliver results… well, very different from the more typical frenetically staring at the computer, wishing the program would run faster. But there is more — there are some problems that are so intrinsically expensive that, even if you had access to all the computers on Earth, it would take more than your lifetime to solve a slightly non-trivial case of them. Some examples are full configuration interaction calculations in quantum chemistry, factorisation of prime numbers, optimal planning, and a long, long, etcetera.

One of the greatest computer science discoveries in the last few years is the fact that a particular model of computation, quantum computing, can offer notable, when not game-changing improvements to this problems. Remarkably, Shor’s algorithm provides a polynomial-time algorithm for solving prime number factorisation, thus challenging RSA, one of the most extended encryption protocols. Many other algorithms exist, from linear algebra to knot theory. Note however that not all the problems show a quantum improvement – in fact there are many processes that show no advantage over the classical case. In any case, it is widely believed than within the next decade or two — depending on your level of optimism — there will be quantum computers that can challenge our best supercomputers for some of the aforementioned problems.

In this post I want to talk about a particular kind of quantum information processing: quantum annealing, or the more or less equivalent adiabatic quantum computing. This approach is quite popular, not only because its beauty and simplicity, but especially because the Canadian company D-Wave has announced many computers that — they claim — implement this model of computation. Note, however, that D-Wave computers are highly controversial, and many believe that they do not offer any advantage over classical computers. This has not prevented many big players, such as Ford or NASA, to buy some machines at the modest price of 15 million USD each.

This is how D-Wave computers look like! Fortunately, it is possible to access them through a cloud service if you don’t have an industrial storage facility nearby… Image reproduced from the D-Wave webpage.

Adiabatic quantum computing is based on a result from elementary quantum mechanics, the adiabatic theorem. This theorem states, essentially, that if you take a quantum system in the global minimum of a certain potential, and if you change that potential slowly enough to transform it into some other potential, then the system will end up in the global minimum of the last potential. Now, if you can think a clever way to write a problem as a potential whose global minimum represents the solution… well, here you have a procedure to find the solution!

In a slightly more mathematical way, quantum annealing employs two potentials: \(\hat{\mathcal{H}}_\text{start}\), the initial potential, and \(\hat{\mathcal{H}}_\text{end}\), the the “problem” potential. The initial potential is chosen to be very simple, so that we can easily prepare the computer in the global minimum. Our final, “problem” potential is designed so that its global minimum represents the solution to our problem. We then simply transform \(\hat{\mathcal{H}}_\text{start}\) into \(\hat{\mathcal{H}}_\text{end}\), as described by:

The last element to the annealing procedure is the annealing program \(A(t)\), that expresses exactly how \(\hat{\mathcal{H}}_\text{start}\) is transformed into \(\hat{\mathcal{H}}_\text{end}\). It is clearly a continuous function with \(A(0)=1\) and \(A(t_\text{max})=0\), and for most practical purposes you can assume it is a linear program. However, the form of this program determines the runtime of the computer, so there have been some interesting studies on how to design optimal programs that make the annealing process faster.

There is just one problem, and as it often happens with theorems, it lies in the conditions. The adiabatic theorem ensures that we will reach the global minimum of the final potential… if we do it slowly enough. However, what does “slowly enough” mean? In a more or less general parlance, the annealing time must be large in comparison with the difference between the global minimum $E^{(0)}$, and the next lowest energy minimum $E^{(1)}$. It is often mentioned that the scaling in adiabatic quantum computers is proportional to \(\left(1/(E^{(1)}-E^{(0)})^2\right)\).

The apparent simplicity of the annealing process has opened quite a lot of discussion regarding the actual potential of quantum annealers. In particular, it is widely believed that these computers cannot solve some really hard problems, such as, for example, the optimal way to organise a trip between cities (known as travelling salesman problem). Nevertheless, this field is advancing by leaps and bounds, and we can certainly expect interesting results — positive or negative — in the next years.

I thought I would make this blog post very meta by referring to another blog, written by Lior Pachter, which I think has something for many of us in it: http://liorpachter.wordpress.com (networks people, there’s a pretty scathing take-down of a quite well cited 2013 paper as one of the last posts – there seem to be a couple of posts labelled “network nonsense”!)

In particular I refer you to the list, that Lior Pachter has curated, which includes all variations of *-seq. You’ll see that practically all sequencing protocols take on this nomenclature of catchy descriptor + seq.

The UK Research and Innovation (UKRI) Policy Internships Scheme gives PhD students the opportunity to spend three months in a range of policy-relevant organisations, from Government departments to the Royal Society. Applications are open to research council funded PhD students (currently including EU students). The scheme includes a three-month stipend extension, and travel/accommodation expenses are covered either by the host partner or the training grant holder.

Allostery is defined as a conformational/activity change of an active site due to a binding event at a distant (allosteric) site.

The paper I presented in the journal club tried to decipher the underlying mechanics of allostery in PTP1B. It is a protein tyrosine phosphatase (the counter parts of kinases) and a validated drug target. Allosteric binding sites are known but so far neither active site nor allosteric site inhibitors have reached clinical use. Thus, an improved mechanistic understanding could improve drug discovery efforts.

For my blog post this week, I thought I would share, as the title suggests, a small collection of tools and packages that I found to make my work a bit easier over the last few months (mainly python based). I might add to this list as I find new tools that I think deserve a shout-out.

Biopandas

Reading in .pdb files for processing and writing your own parser (while being a good exercise to familiarize yourself with the format) is a pain and clutters your code with boilerplate.

Luckily for us, Sebastian Raschka has written a neat package called biopandas [1] which enables quick I/O of .pdb files via the pandas DataFrame class.

The kernel trick is a well known method in machine learning for producing a real-valued measure of similarity between data points in any number of settings. Kernel methods for network analysis provide a way of assigning real values to vertices of the graph. These values may correspond to similarity across any number of graphical properties such as the neighbours they share, or more dynamic context, the influence that change in the state of one vertex might have on another.

By using the kernel trick it is possible to approximate the distribution of features on the vertices of a graph in a way that respects the graphical relationships between vertices. Kernel based methods have long been used, for instance in inferring protein function from other proteins within Protein Interaction Networks (PINs).

The idea of bacterial superpowers is perhaps most associated with superbugs: the terrifying, drug-resistant bacterial strains that appear ever more frequently in news reports. While the notion of a world where antibiotics no longer work is chilling, this blog post will focus on a more positive aspect of the bacterial domain.

One of the more “niche” bacterial superpowers is magnetotaxis: the ability of certain bacteria to align their motion to the Earth’s magnetic field. This phenomenon was first reported in 1963 by Salvatore Bellini in the University of Pavia. While observing bog sediment under the microscope, he noticed a set of bacteria orienting themselves in the same direction: towards the Earth’s magnetic North pole. He dubbed these gram-negative bacteria “magnetosensitive”, or “batteri magnetosensibili”, but the discovery went largely unnoticed by the international scientific community [1]. The name “magnetotactic bacteria” (MTB) was introduced about a decade later, when Richard Blakemore reported the same phenomenon for bacteria found in marine sediments [2]. Through transmission electron microscopy, Blakemore was also able to capture the cellular feature that gives MTBs their unusual abilities: a rod-like structure of membrane-bound, iron-rich inorganic crystals, called magnetosomes. Later it was revealed that this structure is supported by a dedicated cytoskeletal system, which keeps it rod-shaped and prevents the aggregation of magnetosomes [4]. Magnetotaxis then results from the combination of the passive alignment of the cell to the Earth’s magnetic field, and flagellar motion. Continue reading →

Over the last decade, single-threaded CPU performance has begun to plateau, whilst the number of logical cores has been increasing exponentially.

Like it or loathe it, for the last few years, python has featured as one of the top ten most popular languages [tiobe / PYPL]. That being said however, python has an issue which makes life harder for the user wanting to take advantage of this parallelism windfall. That issue is called the GIL (Global Interpreter Lock). The GIL can be thought of as the conch shell from Lord of the Flies. You have to hold the conch (GIL) for your thread to be computed. With only one conch, no matter how beautifully written and multithreaded your code, there will still only be one thread will be executed at any point in time.

In my new research project, I investigate Non-alcoholic fatty liver disease (NAFLD). This term describes a variety of conditions that are associated with fatty livers. While the early stages of this disease are not harmful, it can lead to cirrhosis (Cirrhosis is the scaring of liver tissue that prevents a liver to function properly). Ultimately, if a liver stops working it can be fatal unless treated, for example, with a liver transplant. NAFLD is the most common liver disease in developed countries and is expected to become the leading cause of liver transplant by 2020 [1].

The disease progresses in four stages:

steatosis — a buildup of fat in the liver (healthy livers are almost fat-free)

non-alcoholic steatohepatitis — an inflammation of liver tissue due to the fat

fibrosis — a formation of scar tissue due to the inflammation

cirrhosis — a permanent lack of functionality of the liver due to the scar tissue

Currently, no specific NAFLD medication is available. Therefore, the most successful treatment is the change of one’s lifestyle. This includes the reduction of body weight, healthier dieting, exercising, and quitting smoking. [2]

While there is no detailed understanding of the processes involved with NAFLD formation, the influence of obesity and insulin-resistance are likely [3,4]. In our new project, which is in collaboration with the Novo Nordisk Research Centre Oxford, we want to understand the biological pathways associated with NAFLD. To achieve this, we combine data from multiple biological experiments, including gene expression, protein interaction data, and blood measurements.

In cheminformatics, there are many possible ways to encode chemical data represented by small molecules and proteins, such as SMILES, fingerprints, chemical descriptors etc. Recently, utilising graph-based methods for machine learning have become more prominent. In this post, we will explore why representing molecules as graphs is a natural and suitable encoding. Continue reading →