A data driven blog

A couple of years ago, I participated in a workshop on academic data science at SICS in Stockholm. At that event, we discussed various trends in data science and machine learning and at the end of it, I participated in a discussion group, led by professor Niklas Lavesson from Blekinge Institute of Technology, where we talked about model interpretability and explanation. At the time, it felt like a fringe but interesting topic. Today, this topic seems to be all over the place. Here are some of the places I’ve seen it recently.

Ideas on interpreting machine learning. This is a very thorough blog post from O’Reilly with a lot of good ideas. It also talks about related things such as dimensionality reduction which I would not call model explanation per se, but which are still good to know.

Papers with software

Understanding Black-box Predictions via Influence Functions. The paper of this name (associated code here) won a best-paper award at ICML 2017 (again showing how hot this topic is!). The authors use something called an influence function to quantify, roughly speaking, how much a perturbation of a single example in the training data set affects the resulting model. In this way, they can identify the training data points most responsible for a given prediction. One might say that they have figured out a way to differentiate a predictive model with respect to data points in the training set.

LIME, Local Interpretable Model-agnostic Explanations. (arXiv link, code on Github) This has been around for more than a year and can thus be called “established” in the rapidly changing world of machine learning. I have tried it myself for a consulting gig and found it useful for understanding why a certain prediction was made. The main implementation is in Python but there is also a good R port (which is what I used when I tried it.) LIME essentially builds a simplified local model around the data point you are interested in. It does this by perturbing real training data points, obtaining the predicted label for those perturbed points, and fitting a sparse linear model to those points and labels. (As far as I have understood, that is!)

I’m sure I have missed a lot of interesting work.

If anyone is interested, I might write another blog post illustrating how LIME can be used to understand why a certain prediction was made on a public dataset. I might even try to explain the influence function paper if I get the time to try it and digest the math.

Attention conservation notice: This may mostly be interesting for Nordics.

Many of us in the Nordics are a bit obsessed with the weather. Especially during summer, we keep checking different weather apps or newspaper prognoses to find out whether we will be able to go to the beach or have a barbecue party tomorrow. In Sweden, the main source of predictions is the Swedish Meteorological and Hydrological Institute, but many also use for instance the Klart.se site/app, which uses predictions from the Finnish company Foreca. The Norwegian Meteorological Institute’s yr.no site is also popular.

Various kinds of folk lore exists around these prognoses, for instance one often hears that the ones from the Norwegian Meteorological Institute (at yr.no) are better than those from the Swedish equivalent (at smhi.se)

As a hobby project, we decided to test this claim, focusing on Stockholm as that is where we currently live. We started collecting data in May 2016, so we now (July 2017) have more than one year’s worth of data to check how well the two forecasts perform.

The main task we considered was to predict the temperature in Stockholm (Bromma, latitude 59.3, longitude 18.1) 24 hours in advance. As SMHI and YR usually don’t publish forecasts at exactly the same times, we can’t compare them directly data point by data point. However, we do have the measured temperature recorded hourly, so we can compare each forecast from either SMHI or YR to the actual temperature.

Methods

SMHI forecasts were downloaded through their API via this URL every fourth hour using crontab.

YR forecasts were downloaded through their API via this URL every fourth hour using crontab.

First, some summary statistics. On the whole, there are no dramatic differences between the two forecasting agencies. It is clear that SMHI is not worse than YR on predicting the temperature in Stockholm 24h in advance (probably not significantly better either, judging from some preliminary statistical tests conducted on the absolute deviations of the forecasts from the actual temperatures).

Both institutes are doing well in terms of correlation (Pearson and Spearman correlation ~0.98 between forecast and actual temperature). The median absolute deviation is 1, meaning that the most typical error is to get the temperature wrong by one degree Celsius in either direction. The mean squared error is around 2.5 degrees for both.

Forecaster

Correlation with measured temperature

Mean squared error

Median absolute deviation

Slope in linear model

Intercept in linear model

SMHI

0.982

2.37

1

1.0

0.254

YR

0.980

2.51

1

1.0

0.141

Let’s take a look at how this looks visually. Here is a plot of SMHI predictions vs temperatures measured 24 hours later. There are about 2400 data points here (6 per day, and a bit more than a year’s worth of data). The color indicates the density of points in that part of the plot.

And here is the corresponding plot for YR forecasts.

Again, there are about 2400 data points here.

Unfortunately, those 2400 data points are not exactly for the same times in the SMHI and YR datasets, because the two agencies do not publish forecasts for exactly the same times (at least the way we collected the data). Therefore we only have 474 data points where both SMHI and YR had made forecasts for the same time point 24h into the future. Here is a plot of how those forecasts look.

So what?

This doesn’t really say that much about weather forecasting unless you are specifically interested in Stockholm weather. However, the code can of course be adapted and the exercise can be repeated for other locations. We just thought it was a fun mini-project to check the claim that there was a big difference between the two national weather forecasting services.

Code and data

If anyone is interested, I will put up code and data on GitHub. Leave a message here, on my Twitter or email.

Possible extensions

Accuracy in predicting rain (probably more useful).
Accuracy as a function of how far ahead you look.

Note: this is a re-post of an analysis previously hosted at mindalyzer.com. Originally published in late December 2016, this blog post was later followed up by this extended analysis on Follow the Data.

The big picture of public discourse on Twitter by clustering metadata | Mindalyzer

Summary: We identify communities in the Swedish twitterverse by analyzing a large network of millions of reciprocal mentions in a sample of 312,292,997 tweets from 435,792 twitter accounts in 2015 and show that politically meaningful communities among others can be detected without having to read or search for specific words or phrases.

Inspired by Hampus Brynolf’s Twittercensus, we wanted to perform a large-scale analysis of the Swedish Twitterverse, but from a different perspective where we focus on discussions rather than follower statistics.

All images are licensed under Creative Commons CC-BY (mention the source) and the data is released under Creative Commons Zero which means you can freely download and use it for your own purposes, no matter what. The underlaying tweets are restricted by Twitters Developer Agreement and Policy and cannot be shared due to their restrictions, which are mainly there to protect the privacy of all Twitter users.

The dataset was created by continously polling Twitter’s REST API for recent tweets from a fixed set of Twitter accounts during 2015. The API gives out tweets from before the polling starts as well, but Twitter does not document how those are selected. A more in depth description of how the dataset was created and what it looks like can be found at mindalyzer.com.

From the full dataset of tweets, the tweets originating from 2015 was filtered out and a network of reciprocal mentions was created by parsing out any at-mentions (e.g. ‘@mattiasostmar’) in them. Retweets of others people’s tweets have not been counted, even though they might contain mentions of other users. We look at reciprocal mention graphs, where a link between two users means that both have mentioned each other on Twitter at least once in the dataset (i.e. addressed the other user with that user’s Twitter handle, as happens by default when you reply to a Tweet, for instance). We take this as a proxy for a discussion happening between those two users. The mention graphs were generated using the NetworkX package for Python. We naturally model the graph as undirected (as both users sharing a link are interacting with each other, there is no notion of directionality) and unweighted. One could easily imagine a weighted version of the mention graph where the weight would represent the total number of reciprocal mentions between the users, but we did not feel that this was needed to achieve interesting results.

The final graph consisted of 377.545 nodes (Twitter accounts) and 15.862.275 edges (reciprocal mentions connecting Twitter accounts). The average number of reciprocal mentions for nodes in the graph was 42. The code for the graph creation can be found here and you can also download the pickled graph in NetworkX-format (104,5MB, license:CCO).

The visualizations of the graphs were done in Gephi using the Fruchterman Reingold layout algoritm and thereafter adjusting the nodes with the Noverlap algorithm and finally the labels where adjusted with the algoritm Label adjust. Node size were set based on the ‘importance’ measure that comes out of the Infomap algoritm.

In order to find communities in the mention graph (in other words, to cluster the mention graph), we use Infomap, an information-theory based approach to multi-level community detection that has been used for e.g. mapping of biogeographical regions such as Edler, Etal 2015 and scientific publications such as Rosvall, Bergström 2010, among many examples. This algorithm, which can be used both for directed and undirected, weighted and unweighted networks, allows for multi-level community detection, but here we only show results from a single partition into communities. (We also tried a multi-level decomposition, but did not feel that this added to the analysis presented here.)

“The Infomap algorithm returned a bunch of clusters along with a score for each user indicating how “central” that person was in the network, measured by a form of PageRank, which is the measure Google introduced to rank web pages. Roughly speaking, a person involved in a lot of discussions with other users who are in turn highly ranked would get high scores by this measure. For some clusters, it was enough to have a quick glance at the top ranked users to get a sense of what type of discourse that defines that cluster. To be able to look at them all we performed language analysis of each cluster’s users tweets to see what words were the most distinguishing. That way we also had words to judge the quaility of the clusters from.

We took the top 20 communities in terms of size, collected the tweets during 2015 from each member in those clusters, and created a textual corpus out of that (more specifically, a Dictionary using the Gensim package for Python). Then, for each community, we tried to find the most over-represented words used by people in that community by calculating the TF-IDF (term frequency-inverse document frequency) for each word in each community, and looking at the top 10 words for each community.

When looking at these overrepresented words, it was really easy to assign “themes” to our clusters. For instance, communities representing Norwegian and Finnish users (who presumably sometimes tweet in Swedish) were trivial to identify. It was also easy to spot a community dedicated to discussing the state of Swedish schools, another one devoted to the popular Swedish band The Fooo Conspiracy, and an immigration-critical cluster. In fact we have defined dozens of thematically distinct communities and continue to find new ones.

One of the communities we found, which tends to discuss military defense issues and “prepping”, is shown in a graph below. This corresponds almost eerily well to a set of Swedish Twitter users highlighted in large Swedish daily Svenska Dagbladet Försvarstwittrarna som blivit maktfaktor i debatten. In fact, of their list of the top 10 defense bloggers, we find each and every one of them in our top 18. Remember that our analysis uses no pre-existing knowledge of what we are looking for: the defense cluster just fell out of the mention graph decomposition.

The largest cluster is a bit harder to summarize easily than many of the other ones, but we think of it as a “pundit cluster” with influential political commentators, for example political journalists and politicians from many different parties. The most influential user in this community according to our analysis is @sakine, Sakine Madon, who was also the most influential Twitter user in Mattias eigenvector centrality based analysis of the whole mention graph (i.e. not divided into communities).

One of the larger clusters consists of accounts clearly focused on immigration issues judging by the most distinguishing words. An observation is that while all the larger Swedish political parties official Twitter accounts are located within the “general pundit” community, Sverigedemokraterna (The Sweden Democrats) that was born out of immigration critical sentiments is the only one of them located in this commuity. This suggests that they have (or at least had in the period up until 2015) an outsider position in the public discourse on Twitter that might or might not reflect such a position in the general public political discourse in Sweden. There is much debate and worry about filter bubbles formed by algorithms that selects what people get to see. Research such as Credibility and trust of information in online environments suggests that the social filtering of content is a strong factor for influence. Strong ties such as being part of a conversation graph such as this would most likely be an important factor in shaping of your world views.

Since we have the pipeline ready, we can easily redo it for 2016 when the data are in hand. Possibly this will reveal dynamical changes in what gets discussed on Twitter, and may give indications on how people are moving between different communities. It could also be interesting to experiment with a weighted version of the graph, or to examine a hierarchical decomposition of the graph into multiple levels.

TL;DR

I made a community decomposition of Swedish Twitter accounts in 2015 and 2016 and you can explore it in an online app.

Background

As reported on this blog a couple of months ago, (and also here). I have (together with Mattias Östmar) been investigating the community structure of Swedish Twitter users. The analysis we posted then addressed data from 2015 and we basically just wanted to get a handle on what kind of information you can get from this type of analysis.

With the processing pipeline already set up, it was straightforward to repeat the analysis for the fresh data from 2016 as soon as Mattias had finished collecting it. The nice thing about having data from two different years in that we can start to look at the dynamics – namely, how stable communities are, which communities are born or disappear, and how people move between them.

The app

First of all, I made an app for exploring these data. If you are interested in this topic, please help me understand the communities that we have detected by using the “Suggest topic” textbox under the “Community info” tab. That is an attempt to crowdsource the “annotation” of these communities. The suggestions that are submitted are saved in a text file which I will review from time to time and update the community descriptions accordingly.

The fastest climbers

By looking at the data in the app, we can find out some pretty interesting things. For instance, the account that easily increased to most in influence (measured in PageRank) was @BjorklundVictor, who climbed from a rank of 3673 in 2015 in community #4 (which we choose to annotate as an “immigration” community) to a rank of 3 (!) in community #4 in 2016 (this community has also been classified as an immigration-discussion community, and it is the most similar one of all 2016 communities to the 2015 immigration community.) I am not personally familiar with this account, but he must have done something to radically increase his reach in 2016.

Some other people/accounts that increased a lot in influence were professor Agnes Wold (@AgnesWold) who climbed from rank 59 to rank 3 in the biggest community, which we call the “pundit cluster” (it has ID 1 both in 2015 and 2016), @staffanlandin, who went from #189 to #16 in the same community, and @PssiP, who climbed from rank 135 to rank 8 in the defense/prepping community (ID 16 in 2015, ID 9 in 2016).

Some people have jumped to a different community and improved their rank in that way, like @hanifbali, who went from #20 in community 1 (general punditry) in 2015 to the top spot, #1 in the immigration cluster (ID 4) in 2016, and @fleijerstam, who went from #200 in the pundit community in 2015 to #10 in the politics community (#3) in 2016.

Examples of users who lost a lot of ground in their own community are @asaromson (Åsa Romson, the ex-leader of the Green Party; #7 -> #241 in the green community) and @rogsahl (#10 -> #905 in the immigration community).

The most stable communities

It turned out that the most stable communities (i.e. the communities that had the most members in common relative to their total sizes in 2015 and 2016 respectively) were the ones containing accounts using a different language from Swedish, namely the Norwegian, Danish and Finnish communities.

The least stable community

Among the larger communities in 2015, we identified the one that was furthest from having a close equivalent in 2016. This was 2015 community 9, where the most influential account was @thefooomusic. This is a boy band whose popularity arguably hit a peak in 2015. The community closest to it in 2016 is community 24, but when we looked closer at that (which you can also do in the app!), we found that many YouTube stars had “migrated” into 2016 cluster 24 from 2015 cluster 84, which upon inspection turned out to be a very clear Swedish YouTuber cluster with stars such as Clara Henry, William Spetz and Therese Lindgren.

So in other words, the The Fooo fan cluster and the YouTuber cluster from 2015 merged into a mixed cluster in 2016.

New communities

We were hoping to see some completely new communities appear in 2016, but that did not really happen, at least not for the top 100 communities. Granted, there was one that had an extremely low similarity to any 2015 community, but that turned out to be a “community” topped by @SJ_AB, a railway company that replies to a large number of customer queries and complaints on Twitter (which, by the way, makes it the top account of them all in terms of centrality.) Because this company is responding to queries from new people all the time, it’s not really part of a “community” as such, and the composition of the cluster will naturally change a lot from year to year.

Community 24, which was discussed above, was also dissimilar from all the 2015 communitites, but as described, we notice it has absorbed users from 2015 clusters 9 (The Fooo) and 84 (YouTubers).

Movement between the largest communities

The similarity score for the “pundit clusters” (community 1 in 2015 and community 1 in 2016, respectively) somewhat surprisingly showed that these were not very similar overall, although many of the top-ranked users are the same. A quick inspection also showed that the entire top list of community 3 in 2015 moved to community 1 in 2016, which makes the 2015 community 3 the closest equivalent to the 2016 community 1. Both of these communities can be characterized as general political discussion/punditry clusters.

Comparison: The defense/prepper community in 2015 vs 2016

In our previous blog post on this topic, we presented a top-10 list of defense Twitterers and compared that to a manually curated list from Swedish daily Svenska Dagbladet. Here we will present our top-10 list for 2016.

Caveats

Of course, many parts of this analysis could be improved and there are some important caveats. For example, the Infomap algorithm is not deterministic, which means that you are likely to get somewhat different results each time you run it. For these data, we have run it a number of times and seen that you get results that are similar in a general sense each time (in terms of community sizes, top influencers and so on), but it should be understood that some accounts (even top influencers) can in some cases move around between communities just because of this non-deterministic aspect of the algorithm.

Also, it is possible that the way we use to measure community similarity (the Jaccard index, which is the ratio between the number of members in common between two communities and the number of members that are in any or both of the communities – or to put it in another way, the intersection divided by the union) is too coarse, because it does not consider the influence of individual users.

generate molecular and other data using many different platforms (multi-omics), resulting in tens or hundreds of thousands of measurements for each individual data point,

use or claim to use artificial intelligence/machine learning to reach their goals.

So the heading of this blog post could just as well have been for instance “AI wellness companies” or “Molecular wellness monitoring companies”. The point with using “data-intensive” is that they all generate much more extensive molecular data on their users (DNA sequencing, RNA sequencing, proteomics, metagenomics, …) than, say, WellnessFX, LifeSum or more niche wellness solutions.

I associate these three companies with three big names in genomics.

Arivale was founded by Leroy Hood, who is president of the Institute for Systems Biology and was involved in developing the automatization of DNA sequencing. In connection with Arivale, Hood as talked about dense dynamic data clouds that will allow individuals to track their health status and make better lifestyle decisions. Arivale’s web page also talks a lot about scientific wellness. They have different plans, including a 3,500 USD one-time plan. They sample blood, saliva and the gut microbiome and have special coaches who give feedback on findings, including genetic variants and how well you have done with your FitBit.

Q, or q.bio, (podcast about them here) seems to have grown out of Michael Snyder‘s work on iPOPs, “individual personal omics profiles“, which he first developed on himself as the first person to do both DNA sequencing, repeated RNA sequencing, metagenomics etc. on himself. (He has also been involved in a large number of other pioneering genomics projects.) Q’s web site and blog talks about quantified health and the importance of measuring your physiological variables regularly to get a “positive feedback loop”. In one of their blog posts, they talk about dentistry as a model system where we get regular feedback, have lots and lots of longitudinal data on people’s dental health, and therefore get continuously improving dental status at cheaper prices. They also make the following point: We live in a world where we use millions of variables to predict what ad you will click on, what movie you might watch, whether you are creditworthy, the price of commodities, and even what the weather will be like next week. Yet, we continue to conduct limited clinical studies where we try and reduce our understanding of human health and pathology to single variable differences in groups of people, when we have enormous evidence that the results of these studies are not necessarily relevant for each and every one of us.

iCarbonX, a Chinese company, was founded by (and is headed by) Wang Jun, the former wunderkid-CEO of Beijing Genomics Institute/BGI. A couple of years ago, he gave an interview to Nature where he talked about why he was stepping down as BGI’s CEO to “devote himself to a new “lifetime project” of creating an AI health-monitoring system that would identify relationships between individual human genomic data, physiological traits (phenotypes) and lifestyle choices in order to provide advice on healthier living and to predict, and prevent, disease.” iCarbonX seems to be the company embodying that idea. Their website mentions “holographic health data” and talks a lot about artificial intelligence and machine learning, more so than the two other companies I highlight here. They also mention plans to profile millions of Chinese customers and to create an “intelligent robot” for personal health management. iCarbonX has just announced a collaboration with PatientsLikeMe, in which iCarbonX will provide “multi-omics characterization services.”

What to make of these companies? They are certainly intriguing and exciting. Regarding the multi-omics part, I know from personal experience that it is very difficult to integrate omics data sets in a meaningful way (that leads to some sort of actionable results), mostly for purely conceptual/mathematical reasons but also because of technical quality issues that impact each platform in a different way. I have seen presentations by Snyder and Hood and while they were interesting, I did not really see any examples of a result that had come through integrating multiple levels of omics (although it is of course useful to have results from “single-level omics” too!).

Similarly, with respect to AI/ML, I expect that a larger number of samples than what these companies have will be needed before, for instance, good deep learning models can be trained. On the other hand, the multi-omics aspect may prove helpful in a deep learning scenario if it turns out that information from different experiments can be combined some sort of transfer learning setting.

As for the wellness benefits, it will likely be several years before we get good statistics on how large an improvement one can get by monitoring one’s molecular profiles (although it is certainly likely that it will be beneficial to some extent.)

PostScript

There are some related companies or projects that I do not discuss above. For example, Craig Venter’s Human Longevity Inc is not dissimilar to these companies but I perceive it as more genome-sequencing focused and explicitly targeting various diseases and aging (rather than wellness monitoring.) Google’s/Verily’s Baseline study has some similarities with respect to multi-omics but is anonymized and not focused on monitoring health. There are several academic projects along similar lines (including one to which I am currently affiliated) but this blog post is about commercial versions of molecular wellness monitoring.

Mattias Östmar and me have published an analysis of the “big picture” of discourse in the Swedish Twitterverse that we have been working on for a while, on and off. Mattias hatched the idea to take a different perspective from looking at keywords or numbers of followers or tweets, and instead try to focus on engagement and interaction by looking at reciprocal mention graphs – graphs where two users get a link between them if both have mentioned each other at least once (as happens by default when you reply to a tweet, for example.) He then applied an eigenvector centrality measure to that network and was able to measure the influence of each user in that way (described in Swedish here).

In the present analysis we went further and tried to identify communities in the mention network by clustering the graph. After trying some different methods we eventually went with Infomap, a very general information-theory based method (it handles both directed and undirected, weighted and unweighted networks, and can do multi-level decompositions) that seems to work well for this purpose. Infomap not only detects clusters but also ranks each user by a PageRank measure so that the centrality score comes for free.

We immediately recognized from scanning the top accounts in each cluster that there seemed to be definite themes to the clusters. The easiest to pick out were Norwegian and Finnish clusters where most of the tweets were in those languages (but some were in Swedish, which had caused those accounts to be flagged as “Swedish”.) But it was also possible to see (at this point still by recognizing names of famous accounts) that there were communities that seemed to be about national defence or the state of Swedish schools, for instance. This was quite satisfying as we hadn’t used the actual contents of the tweets – no keywords or key phrases – just the connectivity of the network!

Still, knowing about famous accounts can only take us so far, so we did a relatively simple language analysis of the top 20 communities by size. We took all the tweets from all users in those communities, built a corpus of words of those, and calculated the TF-IDFs for each word in each community. In this way, we were able to identify words that were over-represented in a community with respect to the other communities.

The words that feel out of this analysis were in many cases very descriptive of the communities, and apart from the school and defence clusters we quickly identified an immigration-critical cluster, a cluster about stock trading, a sports cluster, a cluster about the boy band The Fooo Conspiracy, and many others. (In fact, we have since discovered that there are a lot of interesting and thematically very specific clusters beyond the top 20 which we are eager to explore!)

As detailed in the analysis blog post, the list of top ranked accounts in our defence community was very close to a curated list of important defence Twitter accounts recently published by a major Swedish daily. This probably means that we can identify the most important Swedish tweeps for many different topics without manual curation.

This work was done on tweets from 2015, but in mid-January we will repeat the analysis on 2016 data.

For quite a while now, I have been rather mystified and intrigued by the fact that Sweden has one of the highest rates of school fires due to arson. According to the Division of Fire Safety Engineering at Lund University, “Almost every day between one and two school fires occur in Sweden. In most cases arson is the cause of the fire.” This is a lot for a small country with less than 10 million inhabitants, and the associated costs can be up to a billion SEK (around 120 million USD) per year.

It would be hard to find a suitable dataset to address the question why arson school fires are so frequent in Sweden compared to other countries in a data-driven way – but perhaps it would be possible to stay within a Swedish context and find out which properties and indicators of Swedish towns (municipalities, to be exact) might be related to a high frequency of school fires?

To answer this question, I collected data on school fire cases in Sweden between 1998 and 2014 through a web site with official statistics from the Swedish Civil Contingencies Agency. As there was no API to allow easy programmatic access to schools fire data, I collected them by a quasi-manual process, downloading XLSX report generated from the database year by year, after which I joined these with an R script into a single table of school fire cases where the suspected reason was arson. (see Github link below for full details!)

To complement these data, I used a list of municipal KPI:s (key performance indicators) from 2014, that Johan Dahlberg put together for our contribution in Hack for Sweden earlier this year. These KPIs were extracted from Kolada (a database of Swedish municipality and county council statistics) by repeatedly querying its API.

There is a Github repo containing all the data and detailed information on how it was extracted.

The open Kaggle dataset lives at https://www.kaggle.com/mikaelhuss/swedish-school-fires. So far, the process of uploading and describing the data has been smooth. I’ve learned that each Kaggle dataset has an associated discussion forum, and (potentially) a bunch of “kernels”, which are analysis scripts or notebooks in Python, R or Julia. I hope that other people will contribute script and analyses based on these data. Please do if you find this dataset intriguing!

In talks that I have given in the past few years, I have often made the point that most of genomics has not been “big data” in the usual sense, because although the raw data files can often be large, they are often processed in a more or less predictable way until they are “small” (e.g., tables of gene expression measurements or genetic variants in a small number of samples). This in turn depends on the fact that it is hard and expensive to obtain biological samples, so in a typical genomics project the sample size is small (from just a few to tens or in rare cases hundreds or thousands) while the dimensionality is large (e.g. 20,000 genes, 10,000 proteins or a million SNPs). This is in contrast to many “canonical big data” scenarios where one has a large number of examples (like product purchases) with a small dimensionality (maybe the price, category and some other properties of the product.)

Because of these issues, I have been hopeful about using published data on e.g. gene expression based on RNA sequencing or on metagenomics to draw conclusions based on data from many studies. In the former case (gene expression/RNA-seq) it could be to build classifiers for predicting tissue or cell type for a given gene expression profile. In the latter case (metagenomics/metatranscriptomics, maybe even metaproteomics) it could also be to build classifiers but also to discover completely new varieties of e.g. bacteria or viruses from the “biological dark matter” that makes up a large fraction of currently generated metagenomics data. These kinds of analysis are usually called meta-analysis, but I am fond of the term cumulative biology, which I came across in a paper by Samuel Kaski and colleagues (Toward Computational Cumulative Biology by Combining Models of Biological Datasets.)

Of course, there is nothing new about meta-analysis or cumulative biology – many “cumulative” studies have been published about microarray data – but nevertheless, I think that some kind of threshold has been crossed when it comes to really making use of the data deposited in public repositories. There has been development both in APIs allowing access to public data, in data structures that have been designed to deal specifically with large sequence data, and in automating analysis pipelines.

Below are some interesting papers and packages that are all in some way related to analyzing public gene expression data in different ways. I annotate each resource with a couple of tags.

Sequence Bloom Trees. [data structures] These data structures (described in the paper Fast search of thousands of short-read sequencing experiments) allow indexing of a very large number of sequences into a data structure that can be rapidly queried with your own data. I first tried it about a year ago and found it to be useful to check for the presence of short snippets of interest (RNA sequences corresponding to expressed peptides of a certain type) in published transcriptomes. The authors have made available a database of 2,652 RNA-seq experiments from human brain, breast and blood which served as a very useful reference point.

The Lair. [pipelines, automation, reprocessing] Lior Pachter and the rest of the gang behind popular RNA-seq analysis tools Kallisto and Sleuth have taken their concept further with Lair, a platform for interactive re-analysis of published RNA-seq datasets. They use a Snakemake based analysis pipeline to process and analyze experiments in a consistent way – see the example analyses listed here. Anyone can request a similar re-analysis of a published data set by providing a config file, design matrix and other details as described here.

Toil. [pipelines, automation, reprocessing] The abstract of this paper, which was recently submitted to bioRxiv, states: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. The authors used their workflow software to quantify expression in four studies: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), Pacific Pediatric Neuro-Oncology Consortium (PNOC), and the Genotype Tissue Expression Project (GTEx).

EBI’s RNA-seq-API. [API, discovery, reprocessing, compendium] The RESTful RNA-seq Analysis API provided by the EBI currently contains raw, FPKM and TPM gene and exon counts for a staggering 265,000 public sequencing runs in 264 different species, as well as ftp locations of CRAM, bigWig and bedGraph files. See the documentation here.

Digital Expression Explorer. [reprocessing, compendium] This resource contains hundreds of thousands of uniformly processed RNA-seq data sets (e.g., >73,000 human data sets and >97,000 mouse ones). The data sets were processed into gene-level counts, which led to some Twitter debate between the transcript-level quantification hardliners and the gene-count-tolerant communities, if I may label the respective camps in that way. These data sets can be downloaded in bulk.

CompendiumDb. [API, discovery] This is an R package that facilitates the programmatic retrieval of functional genomics data (i.e., often gene expression data) from the Gene Expression Omnibus (GEO), one of the main repositories for this kind of data.

Omics Discovery Index (OmicsDI). [discovery] This is described as a “Knowledge Discovery framework across heterogeneous data (genomics, proteomics and metabolomics)” and is mentioned here both because a lot of it is gene expression data and because it seems like a good resource for finding data across different experimental types for the same conditions.

MetaRNASeq. [discovery] A browser-based query system for finding RNA-seq experiments that fulfill certain search criteria. Seems useful when looking for data sets from a certain disease state, for example.

Tradict. [applications of meta-analysis] In this study, the authors analyzed 23,000 RNA-seq experiments to find out whether gene expression profiles could be reconstructed from a small subset of just 100 marker genes (out of perhaps 20,000 available genes). The author claims that it works well and the manuscript contains some really interesting graphs showing, for example, how most of the variation in gene expression is driven by developmental stage and tissue.

In case you think that these types of meta-analysis are only doable with large computing clusters with lots of processing power and storage, you’ll be happy to find out that it is easy to analyze RNA-seq experiments in a streaming fashion, without having to download FASTQ or even BAM files to disk (Valentine Svensson wrote a nice blog post about this), and with tools such as Kallisto, it does not really take that long to quantify the expression levels in a sample.

Finally, I’ll acknowledge that the discovery-oriented tools above (APIs, metadata search etc) still work on the basis of knowing what kind of data set you are looking for. But another interesting way of searching for expression data would be querying by content, that is, showing a search system the data you have at hand and asking it to provide the data sets most similar to it. This is discussed in the cumulative biology paper mentioned at the start of this blog post: “Instead of searching for datasets that have been described similarly, which may not correspond to a statistical similarity in the datasets themselves, we would like to conduct that search in a data-driven way, using as the query the dataset itself or a statistical (rather than a semantic) description of it.” In a similar vein, Titus Brown has discussed using MinHash signatures for identifying similar samples and finding collaborators.

The Allen Institute for Brain Science has done a tremendous amount of work to digitalize and make available information on gene expression at a fine-grained level both in the mouse brain and the human brain. The Allen Brain Atlas contains a lot of useful information on the developing brain in mouse and human, the aging brain, etc. – both via GUIs and an API.

Among other things, the Allen institute has published gene expression data for healthy human brains divided by brain structure, assessed using both microarrays and RNA sequencing. In the RNA-seq case (which I have been looking at for reasons outlined below), two brains have been sectioned into 121 different parts, each representing one of many anatomical structures. This gives “region-specific” expression data which are quite useful for other researchers who want to compare their brain gene expression experiments to publicly available reference data. Note that each of the defined regions will still be a mix of cell types (various kinds of neuron, astrocytes, oligodendrocytes etc.), so we are still looking at a mix of cell types here, although resolved into brain regions. (Update 2016-07-22: The recently released R package ABAEnrichment seems very useful for a more programmatic approach than the one described here to accessing information about brain structure and cell type specific genes in Allen Brain Atlas data!)

As I have been working on a few projects concerning gene expression in the brain in some specific disease states, there has been a need to compare our own data to “control brains” which are not (to our knowledge) affected by any disease. In one of the projects, it has also been of interest to compare gene expression profiles to expression patterns in specific brain regions. As these projects both used RNA sequencing as their method of quantifying gene (or transcript) expression, I decided to take a closer look at the Allen Institute brain RNA-seq data and eventually ended up writing a small interactive app which is currently hosted at https://mikaelhuss.shinyapps.io/ExploreAllenBrainRNASeq/ (as well as a back-up location available on request if that one doesn’t work.)

A screenshot of the Allen brain RNA-seq visualization app

The primary functions of the app are the following:

(1) To show lists of the most significantly up-regulated genes in each brain structure (genes that are significantly more expressed in that structure than in others, on average). These lists are shown in the upper left corner, and a drop-down menu below the list marked “Main structure” is used to select the structure of interest. As there are data from two brains, the expression level is shown separately for these in units of TPM (transcripts per million). Apart from the columns showing the TPM for each sampled brain (A and B, respectively), there is a column showing the mean expression of the gene across all brain structures, and across both brains.

(2) To show box plots comparing the distribution of the TPM expression levels in the structure of interest (the one selected in the “Main structure” drop-down menu) with the TP distribution in other structures. This can be done on the level of one of the brains or both. You might wonder why there is a “distribution” of expression values in a structure. The reason is simply that there are many samples (biopsies) from the same structure.

So one simple usage scenario would be to select a structure in the drop-down menu, say “Striatum”, and press the “Show top genes” button. This would render a list of genes topped by PCP4, which has a mean TPM of >4,300 in brain A and >2,000 in brain B, but just ~500 on average in all regions. Now you could select PCP4, copy and paste it into the “gene” textbox and click “Show gene expression across regions.” This should render a (ggplot2) box plot partitioned by brain donor.

There is another slightly less useful functionality:

(3) The lower part of the screen is occupied by a principal component plot of all of the samples colored by brain structure (whereas the donor’s identity is indicated by the shape of the plotting character.) The reason I say it’s not so useful is that it’s currently hard-coded to show principal components 1 and 2, while I ponder where I should put drop-down menus or similar allowing selection of arbitrary components.

The PCA plot clearly shows that most of the brain structures are similar in their expression profiles, apart from the structures: cerebral cortex, globus pallidus and striatum, which form their own clusters that consist of samples from both donors. In other words, the gene expression profiles for these structures are distinct enough not to get overshadowed by batch or donor effects and other confounders.

The frequently miserable Swedish weather often makes me think “Is it just me, or is it unusually cold today?” Occasionally, it’s the reverse scenario – “Hmm, seems weirdly warm for April 1st – I wonder what the typical temperature this time of year is?” So I made myself a little Shiny app which is now hosted here. I realize it’s not so interesting for people who don’t live in Stockholm, but then again I have many readers who do … and it would be dead simple to create the same app for another Swedish location, and probably many other locations as well.

The app uses three different data sources, all from the Swedish Meteorological and Hydrological Institute (SMHI). The estimate of the current temperature is taken from the “latest hour” data for Stockholm-Bromma (query). For the historical temperature data, I use two different sources with different granularity. There is a data set that goes back to 1756 which contains daily averages, and another one that goes back to 1961 but which has temperatures at 06:00 (6 am), 12:00 (noon) and 18:00 (6 pm). The latter one makes it easier to compare to the current temperature, at least if you happen to be close to one of those times.