Planetarium:

July 28, 2015

There are several different low pass filters, and as many high pass, band pass, band stop… filters. In Audio toolkit, there are different usual implementation available:

Bessel

Butterworth

Chebyshev type 1

Chebyshev type 2

Second order

Linkwitz-Riley

RBJ

and it is possible to implement other, different orders as well…

Here is a display of the behaviors of these filters (for order 2 for those who can have different orders), amplitude and phase:

Low-pass filter comparison

What is interesting is that for low pass filters, the Butterworth filter is the same as the second order filter implementation. Linkwitz-Riley filters have a lower amplitude than the others, and Chebyshev type 2 has definitely a different behavior than all the others (which is due to the fact that its prototype is different than for the others, it’s not based on band pass, but band stop).

July 27, 2015

So, for some unknown reason I recently thought it might be good to start
putting together some video tutorials for Python. I thought I’d start with
something I was sure I knew a fair bit about: writing GUI applications with
PyQt (+PySide).

It turned out to be a lot more work than I thought, but rewarding to get the
finished product. Like most programming tutorials it is predominantly screencast
based, but I’ve included short lecture-like segments in each part to get
across the key points. For those interested in such things, the setup I used was —

The course is built around communicating the basics needed to get up and running, and then moves on to create a demo application — a tabbed web browser — in PyQt. Since it’s my first attempt at doing this I’m more interested in getting feedback on:

Is it any good?

How can I improve it?

I would also be interested to hear your thoughts on Udemy as a platform - both
on course quality and value for money. It’s obvious that the pricing
strategy there is price high and discount hard which seems a bit disingenuous.
But there is no arguing with the audience that it brings.

So sign up
and let me know what you think, in either a review, a message or the comments.

Week 8 and 9

In the week 8 and 9, I implemented DirichletProcessGaussianMixture. But its behavior looks similar to BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture to converge on Old-faith data set, around 60 iterations.

If we solve Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the
truncated level T. Only the concentration parameter $\alpha$ is enough. In the other hand, with variational inference, we still need to specify the maximal possible number of components, i.e., the truncated level.

At the first, the lower bound of DirichletProcessGaussianMixture seems a little strange. It is not always going up. When some clusters disappear, it goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features. I did the math derivation of Dirichlet process mixture models again, and found it was a bug on the coding of a very long equation.

I also finished the code of BayesianGaussianMixture for 'tied', 'diag' and 'spherical' precision.

My mentor pointed out the style problem in my code and docstrings. I knew PEP8 convention, but got no idea where was also a convention for docstring, PEP257. It took me a lot of time to fix the style problem.

Progress report 2

During the last 5 weeks (since the progress report 1), I finished the

GaussianMixutre with four kinds of covariance

Most test cases of GaussianMixutre

BayesianGaussianMixture with four kinds of covariance

DirichletProcessGaussianMixture

Although I spent some time on some unsuccessful attempts, such as decoupling out observation models and hidden models as mixin classes, double checking DP equations, I did finished the most essential part of my project and did some visualization. In the following 4 weeks, I will finish all the test cases for BayesianGaussianMixture and DirichletProcessGaussianMixture, and did some optional tasks, such as different covariance estimators and incremental GMM.

July 24, 2015

Before I get started I just want to let you know that in this post I will talk about the future of my career and moving beyond the GSoC so this will only be indirectly related to the summer of code.

As you may or may not know, I will start my MSc in Applied Computing at the University of Toronto in September (2015, in case you're reading this in the future). Well, I have decided steer towards topics like Machine Learning, Computer Vision and Natural Language Processing.

While I still don't know what I will end up having has my main area of focus nor where this new journey will take me, I am pretty sure it will have to do with Data Science and Python. I am also sure that I will keep contributing to SciPy and most likely start contributing to other related communities like NumPy, pandas and scikit-learn so you could say that the GSoC has had a positive impact by helping me find areas that make my motivation soar and introducing me to people who have been working in this realm for a very long time and know a ton of stuff that make me want to pick up a book and learn.

In my latest meeting with Ralf (my mentor), we had a discussion regarding the growing scope of the GSoC project and my concern about dealing with all the unforeseen and ambiguous details that arise along the way. He seemed oddly pleased as I proposed to keep in touch with the project even after the "pencils down" date for the GSoC. He then explained that this is the purpose of the summer of code (to bring together students and organisations) and their hope when they choose a student to participate is that he/she will become a longterm active member of the community which is precisely what I would like to do.

I have many thanks to give and there is still a lot of work to be done with the project so I will save the thank you speech for later. For now I just want to say that this has been a great experience and I have already gotten more out of it than I had hoped (which was a lot).

A lot of stuff has happened in the last couple of weeks. The project is coming along nicely and I am now getting into some of the bulky parts of it.

There is an issue with the way NaN (not a number) checks are handled that spans beyond SciPy. Basically, there is no consensus on how to deal with NaN values when they show up. In statistics they are often assumed to be missing values (e.g. there was a problem when gathering statistic data and the value was lost), but there is also the IEEE NaN which is defined as 'undefined' and can be used to indicate out-of-domain values that may point to a bug in one's code or a similar problem.

Long story short, the outcome of this will largely depend on the way projects like pandas and Numpy decide to deal with it in the future, but right now for SciPy we decided that we should not get in the business of assuming that NaN values signify 'missing' because that is not always the case and it may end up silently hiding bugs, leading to incorrect results without the user's knowledge. Therefore, I am now implementing a backwards compatible API addition that will allow the user to define whether to ignore NaN values (asume they are missing), treat them as undefined, or raise an exception. This is a longterm effort that may span through the entire stats module and beyond so the work I am doing now is set to spearhead future development.

Another big issue is the consistency of the `scipy.stats` module with its masked arrays counterpart `scipy.mstats`. The implementation will probably not be complicated but it encompasses somewhere around 60 to 80 functions so I assume it to be a large and time consuming effort. I expect to work on this for the next month or so.

During the course of the last month or two there have been some major developments in my life that are indirectly related to the project so I feel like they should be addressed but I intend do so in a separate post. For now I bid you farewell and thank you for reading.

Note: at the Lab for Data Intensive Biology, we're trying out a new journal club
format where we summarize our thoughts on the paper in a blog post.
For this blog post, Luiz wrote the majority of the text and the rest
of us added questions and comments.

Summary

The method is described as a five-stage read funneling approach, which
successively reduces the number and complexity of the decisions that
need to be made

Stage 1: Region selection

Gapped spaced seeds are not the same as our last journal club, but
it's an interesting strategy for selecting seeds to extend alignments.

Based on Levenshtein distance, it uses gaps
inside k-mers to consider mismatches and indels. Three shapes are
used, with one or two positions being DC ("don't care") bases:

1110111: DC base can be either a match or mismatch

111111: one deletion at the specified position (?)

11100111: DC base and following base are skipped.
At most one insertion and one match/mismatch.

(Can we use longer shapes?
These one are for fairly small _k_,
if we can extend the idea to arbitrary _k_ it might be useful for seeding on graphalign).

Hough transforms are used in a clever way to bin seeds into viable regions,
but it depends on reference coordinates (so it is not so useful for us
in the absence of a decent reference).

Stage 2: Graph-based vertex-centric construction of anchors

The 'graph' part of GraphMap comes from the next step,
where the seeds are processed to construct alignment anchors.
Given a target and a query,
a "kmer mapping graph" (a DAG) is built for the target.
In a "kmer mapping graph" distinct nodes can represent the same k-mer.
For example,
the first step in constructing a 2-mer mapping graph for CTAATATC would be
CT -> TA -> AA -> AT -> TA -> AT -> TC
(note nodes TA, AT appearing twice).
Then, for each vertex an edge is added for every successor until the last node is reached.

These graphs are small,
since target and query are a read sequence and a single region of the reference
(for memory consumption purposes,
read sequence are usually smaller and so used as target).
A new index is constructed for the target on the fly,
with a much smaller seed size (defaults to k=6).
For k < 10 perfect hashing is used,
for k >= 10 a suffix array.

After the graph is ready,
the mapping works by walking along target and query simultaneously.
The query is processed as a sliding window,
and an edge is followed to extend the walk each step in the target,
while keeping track of all walks corresponding to potential mapping sites.

There are similarities to how partial order alignment works,
but how is this stage any different than just doing DP?

Stage 3: Extending anchors into alignments using LCS

Just summary of stages 4 and 5: After we have extended anchors in
stage 3, we will have a set of points representing the alignments from
LCSK, mixed with a set of noise; (indels, sequence errors, etc). To
refine these alignments, we need to draw a line that best fits these
points. This is done by using linear regression, which is used to fit
the alignment's "predictive model" from among these observed points
"list of anchors".

The points that lie on given dl1 from either sides of the line,
represents our best alignments - those points who deviates should
be discarded.

Then the std deviation of anchors from this line, no. of exact kmers
covered by anchors around the line, length of query that matched the
target, no. of bases covered by anchors (normalized) and the read
length are used to compute an f score. The region with highest f
score are picked for final alignment.

(I think the reference coordinates c can be estimated from the position on the read and the position of the hit, so we still can use Hough Transform?)

Stages 4 and 5 seem heavyweight.

Questions and comments:

we are unclear on how index would be constructed in the absence of a
reference in the case of a de novo assembly. Last paragraph of
Discussion states: "GraphMap’s sensitivity and specificity as a
mapper could thus serve as the basis for fast computation of overlap
alignments and de novo assemblies in the future." But algorithm from
Figure 1b and Methods Stage I: region selection seems to be based on
seed finding between query and reference. What is used as reference
in the absence of a reference?

Impressed with figure 3a, difference between GraphMap and LAST
mapping tool with difference in consensus sequences (colored = low
confidence, grey = higher confidence). Would liked to have seen
their BWA-MEM and BLASR results, although Figure 3b suggests LAST
was closer to GraphMap with higher coverage.

Reasons for why we care include new ONP technology in field
applications, e.g. identifying pathogens in remote location with
local install. Species predictions in Table 3, F1 (mean of precision
and recall) higher for GraphMap. Need for more testing with real ONP
data (just 3 species were tested in this paper) and with higher
complexity, e.g. pathogenic microbial eukaryotes?

We were a bit surprised that longest-common-subsequence works so
well with ONT, but that's why they did it with only the subsequences
extracted after the graph approach.

"Our comparisons with BLAST suggest that reads that cannot be mapped
by GraphMap may essentially be unmappable." Did they characterize
these reads at all?

What's the memory usage? Largely or entirely unmentioned.

We were confused by the gapped qspaced seeds/gapped q-gram filter
stuff. (p10)

We do not think they tested for genome scaling appropriately. They
need to show an example for may be a whole human genome. As Lisa
noticed there is no change in precision with their bigger genomes.

The clinical application is very interesting. They did not compare
precision of other mappers using strain specific sequence.

July 23, 2015

Update, July 25, 2015: I included some new plots suggested by my colleague Ariel Rokem. Scroll to the end!

Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors.

Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions. In other words, where the previous post examined the data using a supervised machine learning approach for data modeling, this post will examine the data using an unsupervised learning approach for data exploration.

Along the way, we'll see some examples of importing, transforming, visualizing, and analyzing data in the Python language, using mostly Pandas, Matplotlib, and Scikit-learn. We will also see some real-world examples of the use of unsupervised machine learning algorithms, such as Principal Component Analysis and Gaussian Mixture Models, in exploring and extracting meaning from data.

To spoil the punchline (and perhaps whet your appetite) what we will find is that from analysis of bicycle counts alone, we can make some definite statements about the aggregate work habits of Seattleites who commute by bicycle.

The counts show both a strong seasonal variation, as well as a local structure that can be partially accounted for by temperature, time of year, precipitation, and other factors.

Extracting Knowledge from the Data

From here, we could do a variety of other visualizations based on our intuition about what might affect bicycle counts. For example, we could look at the effect of the days of the week, the effect of the weather, and other factors that I explored previously. But we could also proceed by letting the dataset speak for itself, and use unsupervised machine learning techniques (that is, machine learning without reference to data labels) to learn what the data have to tell us.

We will consider each day in the dataset as its own separate entity (or sample, in usual machine learning parlance). For each day, we have 48 observations: two observations (east and west sidewalk sensors) for each of the 24 hour-long periods. By examining the days in light of these observations and doing some careful analysis, we should be able to extract meaningful quantitative statements from the data themselves, without the need to lean on any other assumptions.

Transforming the Data

The first step in this approach is to transform our data; essentially we will want a two-dimensional matrix, where each row of the matrix corresponds to a day, and each column of the matrix corresponds to one of the 48 observations. We can arrange the data this way using the pivot_table() function in Pandas. We want the "East" and "West" column values, indexed by date, and separated by hour of the day. Any missing values we will fill with zero:

Our data consists of just over 1000 days, each with the aforementioned 48 measurements.

Visualizing the Data

We can think of this data now as representing 1001 distinct objects which live in a 48-dimensional space: the value of each dimension is the number of bicycle trips measured on a particular side of the bridge at a particular hour. Visualizing 48-dimensional data is quite difficult, so instead we will use a standard dimensionality reduction technique to project this to a more manageable size.

The technique we'll use is Principal Component Analysis (PCA), a fast linear projection which rotates the data such that the projection preserves the maximum variance. We can ask for components preserving 90% of the variance as follows:

The output has two dimensions, which means that these two projected components describe at least 90% of the total variance in the dataset. While 48-dimensional data is difficult to plot, we certainly know how to plot two-dimensional data: we'll do a simple scatter plot, and for reference we'll color each point according to the total number of trips taken that day:

We see that the days lie in two quite distinct groups, and that the total number of trips increases along the length of each projected cluster. Further, the two groups begin to be less distinguishable when the number of trips during the day is very small.

I find this extremely interesting: from the raw data, we can determine that there are basically two primary types of days for Seattle bicyclists. Let's model these clusters and try to figure out what these types-of-day are.

Unsupervised Clustering

When you have groups of data you'd like to automatically separate, but no previously-determined labels for the groups, the type of algorithm you are looking at is a clustering algorithm. There are a number of clustering algorithms out there, but for nicely-defined oval-shaped blobs like we see above, Gaussian Mixture Models are a very good choice. We can compute the Gaussian Mixture Model of the data using, again, scikit-learn, and quickly plot the predicted labels for the points:

These plots give us some insight into the interpretation of the two clusters: the first cluster shows a sharp bimodal traffic pattern, while the second shows a wide unimodal pattern.

In the bimodal cluster, we see a peak around 8:00am which is dominated by cyclists on the west sidewalk, and another peak around 5:00pm which is dominated by cyclists on the east sidewalk. This is very clearly a commute pattern, with the majority of cyclists riding toward downtown Seattle in the morning, and away from downtown Seattle in the evening.

In the unimodal cluster, we see fairly steady traffic in each direction beginning early in the morning and going until late at night, with a peak around 2:00 in the afternoon. This is very clearly a recreational pattern of use, with people out riding through the entire day.

I find this is fascinating: from simple unsupervised dimensionality reduction and clustering, we've discovered two distinct classes of days in the data, and found that these classes have very intuitive explanations.

Seattle's Work Habits

Let's go one step deeper and figure out what we can learn about people (well, bicycle commuters) in Seattle from just this hourly commute data. As a rough approximation, you might guess that these two classes of data might be largely reflective of workdays in the first cluster, and non-work days in the second. We can check this intuition by re-plotting our projected data, except labeling them by day of the week:

We see that the weekday/weekend intuition holds, but only to a point: in particular, it is clear that there are a handful of weekdays which follow the typical weekend pattern! Further, it's interesting to note that Fridays tend to be pulled closer to weekend days in this plot, though as a whole they still fall solidly in the work-day cluster.

Let's take a closer look at the "special" weekdays that fall in the "wrong" cluster. We start by constructing a dataset listing the cluster id and the day of the week for each of the dates in our dataset:

2012-01-02 New Years Day
2012-01-16 Dr. Martin Luther King Jr.
2012-02-20 Presidents Day
2012-05-28 MemorialDay
2012-07-04 July 4th
dtype: object

Just for completeness, we will add to the list the day before and day after each of these holidays:

In [18]:

holidays_all=pd.concat([holidays,"Day Before "+holidays.shift(-1,'D'),"Day After "+holidays.shift(1,'D')])holidays_all=holidays_all.sort_index()holidays_all.head()

Out[18]:

2012-01-01 Day Before New Years Day
2012-01-02 New Years Day
2012-01-03 Day After New Years Day
2012-01-15 Day Before Dr. Martin Luther King Jr.
2012-01-16 Dr. Martin Luther King Jr.
dtype: object

Note that these are observed holidays, which is why New Years Day 2012 falls on January 2nd. With this ready to go, we can compute the complete list of non-weekend days on which Seattle bicycle commuters as a whole chose to stay home from work:

Update: What's up with Fridays?

A colleague of mine, Ariel Rokem, saw the first version of this post and noticed something interesting. For the most part, Fridays tend to lie on the upper side of the weekday cluster, closer in this parameter space to the typical weekend pattern. This pattern holds nearly universally for Fridays, all except for three strange outliers which lie far on the other side of the cluster.

We can see these more clearly if we highlight the Friday points in the plot:

Apparently these three strange Fridays are days with extreme amounts of bicycle commuting. But what makes them so special?

After some poking-around on the internet, the answer becomes clear: we've discovered Seattle's annual bike to work day. Mystery solved!

Summary

We have seen here that by taking a close look at raw bicycle counts and using some basic visualization and unsupervised machine learning, we can make some very definite statements about the overall work habits of people in Seattle who bicycle to work across the Fremont bridge. In summary, this is what we have learned:

Seattle cyclists, as a whole, tend to take off work New Year's Day, Memorial Day, Independence Day, Labor Day, and the days surrounding Thanksgiving and Christmas.

Seattle cyclists, as a whole, tend to head to the office on the more minor US holidays: Columbus Day, Martin Luther King Jr. Day, Presidents Day, and Veterans Day.

Seattle cyclists, as a whole, would never, ever, be caught at work on a weekend.

Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

Starting Friday, July 24th, the Continuum Analytics team will be in Washington state for PyData Seattle. Join us at the
Microsoft Conference Center for three days of Python tutorials, talks, and social events.

Three years ago, Travis and I founded a company with a mission to promote and foster the growth of better tools for computational science. Both of us are scientists by training, and both of us have been math and science nerds from childhood. Neither of us can resist geeking out on math and physics when we chat with customers and users about their analytical problems.

General Catalyst Partners and BuildGroup Co-Invest

AUSTIN, TX. — July 23, 2015 —Continuum Analytics, which develops Anaconda, the leading modern open source analytics platform powered by Python, has secured a $24 million Series A funding round led by General Catalyst Partners and BuildGroup. The funds will be leveraged to accelerate product development, expand sales and marketing, and continue to invest in strengthening the Python and Anaconda communities. The latest round of $24 million in funding comes after $10 million in initial funding from a combination of angel investors and a DARPA award, bringing total funding to-date to $34 million.

tl;dr: We motivate the expansion of parallel programming beyond big
collections. We discuss the usability custom of dask graphs.

Recent Parallel Work Focuses on Big Collections

Parallel databases, Spark, and Dask collections all provide large distributed
collections that handle parallel computation for you. You put data into the
collection, program with a small set of operations like map or groupby, and
the collections handle the parallel processing. This idea has become so
popular that there are now a dozen projects promising big and friendly Pandas
clones.

This is good. These collections provide usable, high-level interfaces for a
large class of common problems.

Custom Workloads

However, many workloads are too complex for these collections. Workloads might
be complex either because they come from sophisticated algorithms
(as we saw in a recent post on SVD) or because they come from the real world,
where problems tend to be messy.

In these cases I tend to see people do two things

Fall back to multiprocessing, MPI or some other explicit form of parallelism

Perform mental gymnastics to fit their problem into Spark using a
clever choice of keys. These cases often fail to acheive much speedup.

Direct Dask Graphs

Historically I’ve recommended the manual construction of dask graphs in these
cases. Manual construction of dask graphs lets you specify fairly arbitrary
workloads that then use the dask schedulers to execute in parallel.
The dask docs hold the
following example of a simple data processing pipeline:

Feedback from users is that this is interesting and powerful but that
programming directly in dictionaries is not inutitive, doesn’t integrate well
with IDEs, and is prone to error.

Introducing dask.do

To create the same custom parallel workloads using normal-ish Python code we
use the dask.do function.
This do function turns any normal Python function into a delayed version that
adds to a dask graph. The do function lets us rewrite the computation above
as follows:

Example: Nested Cross Validation

I sat down with a Machine learning student, Gabriel
Krummenacher and worked to parallelize a
small code to do nested cross validation. Below is a comparison of a
sequential implementation that has been parallelized using dask.do:

You can safely skip reading this code in depth. The take-away is that it’s
somewhat involved but that the addition of parallelism is light.

The parallel version runs about four times faster on my notebook.
Disclaimer: The sequential version presented here is just a downgraded version
of the parallel code, hence why they look so similar. This is available
on github.

So the result of our normal imperative-style for-loop code is a fully
parallelizable dask graph. We visualize that graph below.

test_score.visualize()

Help!

Is this a useful interface? It would be great if people could try this out
and generate feedback on dask.do.

July 21, 2015

I’m happy to announce the release of a side-chain stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression). The side chain channels can be used to steer the gain stage (the same setup will be used, right/left or middle/side).

July 19, 2015

Note: A year ago, I wrote this in response to an editorial request.
Ultimately they weren't interested in publishing it, and I got distracted
and this languished on my hard disk. So when I remembered it recently,
I decided to just push it out to my blog, where I should have put it in
the first place. --titus

All things come to an end eventually, including funding for
computational services. What is a field to do?

As biology steadily becomes more and more data intensive, the need for
community-wide analysis software, databases, data curation, and
Internet services grows. While writing software and gathering data
are both expensive, the cost of maintaining software and curating data
sets over time can dwarf the upfront costs; in the software industry,
for example, software maintenance and support costs can be 10-or
100-fold the initial investment to develop the software. Despite
this, there is often little sustained academic funding for software
maintenance, data curation, and Internet service support; in part,
this may be because the maintenance costs for existing data and
software could easily consume all of the available funding! Yet this
lack of infrastructure is increasingly problematic, as data set sizes
and curation needs grow, and tools to interpret and integrate data
sets become ever more critical to forward progress in biology. How
can we develop and maintain robust infrastructure services while
enabling creative development of new resources and software?

The volume, velocity, and variety of data in biology is stunning, and
would challenge even the handling capacity of more established
data-intensive fields with larger computational investments. For
example, by analogy with astronomy, Golden et al. (2012) propose
bringing processing pipelines closer to the data generating instrument
(in this case, the sequencing machine). While this approach can
certainly help address the volume and velocity of sequencing data, it
fails to address the variety -- there are dozens of types of
sequencing output, with perhaps hundreds of different processing
pipelines, the choice of which depends critically on the biological
system being analyzed and the questions being asked of the data. Some
subfields of biology may well be able to standardize -- for example,
variation analysis for the human genome is increasingly using only a
few processing pipelines -- but for environmental sequencing, the
types of systems and the metadata being gathered are extremely diverse
and nowhere near standardization. We must recognize that our
knowledge of the natural biological world is so shallow, and the data
gathering needs so great, that the field is very immature compared to
other data-intensive sciences like particle physics and astronomy.

How can we build sustained computational analysis and data storage
services, in the face of increasingly large and diverse biological
data sets, with fast-moving analysis needs? This question has been
brought into sharp relief in recent years, with the lapses in funding
of TAIR, Tranche, and CAMERA.

While substantial investments have been made in a variety of genomic
and transcriptomic analysis services, only a few projects have
achieved sustained funding independent of large host institutes.
Chief among these are the biomedically relevant projects, which
include Wormbase, Flybase, and SGD, all of which have been funded for
well over a decade by the NIH. Many others, including iPlant
Collaborative and KBase, are in a ramp-up phase and are still
exploring options for long-term support. With rare exceptions, it is
safe to say that no large cyberinfrastructure effort has successfully
weaned itself from continued large-scale support from a granting
agency - and some have failed to find this continued funding, and have
no clear future.

The challenges for sustainability of cyberinfrastructure are
significant. The necessary mix of data storage, research software
development, database curation, and service hosting requires
substantial and diverse computational expertise, large compute
resources, and extensive community involvement to ensure relevance.
Even individually, these can be hard to find, and yet projects often
try to combine all four of these: to a large extent they buy their own
hardware, manage it with their own infrastructure software, develop
their own research analysis software, store their data, and curate
their databases. Hybrid models exist -- for example, iPlant
Collaborative works with a number of external computational biologists
to develop and integrate tools -- but these efforts are often
centrally managed and continue to require substantial funding for this
integration.

Another challenge is that of maintaining innovation in algorithm and
software development while continuing to provide robust services.
Many innovative computational tools have emerged from small labs and
become more broadly useful, but it can be hard for small labs to
engage with large, centralized infrastructure projects. Moreover,
even in these models, the larger efforts can only engage deeply with a
few collaborators; these choices privilege some tools over others, and
may not be based on technical merit or community need. This may also
arise from the tension between engineering and research needs: large
projects prize engineering stability, while research innovation is
inherently unstable.

There is the hope of a more sustainable path, rooted in two approaches
-- one old, and one new. The old and proven approach is that of open
source. The open source community has existed for almost half a
century now, and has proven to be remarkably capable: open source
languages such as R and Python are widely used in data analysis and
modeling, and the Linux operating system dominates scientific
computing. Moreover, the open source workflow closely tracks the
ideal of a scientific community, with a strong community ethic,
widespread collaboration, and high levels of reproducibility and good
computational practice (Perez and Millman, 2014). The new approach
is cloud computing, where the advent of ubiquitous virtualization
technology has made it possible for entire companies to dynamically
allocate disk and compute infrastructure as needed with no upfront
hardware cost. Open source approaches provide an avenue for long-term
research software sustainability, while cloud computing allows
cyberinfrastructure projects to avoid upfront investment in hardware
and lets them grow with their needs.

Interestingly, two notable exceptions to the cyberinfrastructure
sustainability dilemma exploit both open source practices and cloud
computing. The Galaxy Project develops and maintains an open source
Web-based workflow interface that can be deployed on any virtual
machine, and in recent years has expanded to include cloud-enabled
services that lets users manage larger clusters of computers for more
compute-intensive tasks. Importantly, users pay for their own compute
usage in the cloud: tasks that consume more compute resources will
cost more. Since Galaxy is also locally deployable, heavy users can
eschew the cost of the cloud by installing it on existing local
compute resources. And, finally, large providers such as iPlant
Collaborative can host Galaxy instances for their user communities.
Likewise, the Synapse project is an open source project developed by
Sage Bionetworks that hosts data and provenance information for
cooperative biomedical analysis projects. While less broadly used
than Galaxy, Synapse is -- from an infrastructure perspective --
infinitely expandable: the Sage Bionetworks developers rely entirely
on the Amazon cloud to host their infrastructure, and scale up their
computing hardware as their computing needs increase.

A general approach using open source and cloud computing approaches
could separate data set storage from provision of services, active
database curation, and software development. One example could look
like this: first, long-term community-wide cyberinfrastructure efforts
would focus on static data storage and management, with an emphasis on
building and extending metadata standards and metadata catalogs.
These efforts would place data in centralized cloud storage,
accessible to everyone. There, separately funded data services would
actively index and serve the data to address the questions and
software stacks of specific fields. In tandem, separately funded new
data curation and research software development efforts would work to
refine and extend capabilities.

If we follow this path, substantial upfront investment in tool
development and data curation will still be needed -- there's no such
thing as a free lunch. However, when the project sunsets or funding
lapses, with the open source/open data route there will still be
usable products at the end. And, if it all rests on cloud computing
infrastructure, communities can scale their infrastructure up or down
with their needs and only pay for what they use.

Funders can help push their projects in this direction by requiring
open data and open source licenses, encouraging or even requiring
multiple deployments of the core infrastructure on different cloud
platforms, and ultimately by only funding projects that build in
sustainability from the beginning. Ultimately, funders must request,
and reviewers require, an end-of-life plan for all infrastructure
development efforts, and this end-of-life plan should be Òbaked inÓ to
the project from the very beginning.

In the end, providing robust services while developing research
software and storing and curating data is both challenging and
expensive, and this is not likely to change with a top-down funding or
management paradigm. We need new processes and approaches that enable
bottom-up participation by small and large research groups; open
approaches and cloud computing infrastructure can be a solution.

July 17, 2015

Note: Last week, I submitted my review of Stephen R. Piccolo, Adam
B. Lee, and Michael B. Frampton's paper, Tools and techniques for
computational reproducibility. Soon after,
Dan Katz wrote a blog post about notebooks,
and in a comment I mentioned Piccolo's paper; and, after dropping a note to
Dr. Piccolo, he put it up on bioRxiv! This, in turn, meant that I felt
comfortable posting my review - since I sign my reviews, the authors
already know who I am, so where's the harm? See below.

In this paper, Piccolo et al. do a nice (and I think comprehensive?)
job of outlining six strategies for computational reproducibility.
The point is well made that science is increasingly dependent on
computational reproducibility (and that in theory we should be able to
do computational reproducibility easily and well) and hence we should
explore effective approaches that are actually being used.

I know of no other paper that covers this array of material, and this
is a quite nice exposition that I would recommend to many. I can't
evaluate how broadly it will appeal to a diverse audience but it seems
very readable to me.

The following comments are offered as helpful suggestions, not
criticisms -- make of them what you will.

The paper almost completely ignores HPC. I'm good with that, but it's
a bit surprising (since many computational scientists seem to think
that reproducible orchestration of many processors is an unachievable
task). Noted in passing.

I was somewhat surprised by the lack of emphasis of version control
systems. These are really critical in programming for ensuring
reproducibility. I also found a missing citation! You should look at
journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745
(yes, sorry, I'm on the paper).

Speaking of which, I appreciate the completeness of references (and
even the citation of my blog post ;) but it would be interesting to
see if Millman and Perez have anything to offer:
http://www.jarrodmillman.com/oss-chapter.html. Certainly a good
citation (I think you hit the book, but this is a particularly good
chapter.)

I would suggest (in the section that mentions version control systems,
~line 170 of p9) recommending that authors "tag" specific versions for
the publication, even if they later recommend using updated versions.
(Too many people say "use this repo!" without specifying a revision.)

The section on literate programming could usefully mention that these
literate programming environments do not offer good mechanisms for
long running programs, so they may not be appropriate for things
that take more than a few minutes to run.

Also, and perhaps most important, these literate programming
environments provide REPL and can thus track exploratory data analysis
and "harden" it when it works and the author moves onto another data
analysis - so even if the authors don't want to clean up their
notebook before publication, you can track exactly how they got their
final results. I think this is important for practical
reproducibility. I don't know quite what to suggest in the context of
the paper but it seems like an important point to me.

Both the virtual machine and container sections should mention the
challenges of raw data bundling, which is one of the major drawbacks
here - not only is the VM large, but (unless you are partnering with
e.g. Amazon to "scale out") you must distribute potentially large data
sets. I think this is one of the biggest practical issues facing data
intensive sciences. (There was a nice commentary recently by folk in
human genomics begging the NIH to make human genomic data available
via the cloud; I can track it down if the authors haven't seen it.)

I think it's important to emphasize how transparent most Dockerfiles
are (and how this is a different culture than the VM deployment scene,
where configuration systems are often not particularly emphasized
except in the devops community). I view this as one of the most
important cultural differences driving container adoption, and for
once it's good for science!

The docker ecosystem also seems quite robust, which is important, I think.

July 14, 2015

I’ve started working with the HPC Toolkit, especially the part where it can use PAPI to read hardware counters and display the result in a nice GUI.

Installation

Installation is a little bit of a pain, but it is simplified somewhat, as all external dependencies are available in a unique repository, except for PAPI that you have to install by hand.

What is also handy is that all the GUIs are available on all platforms and used right away. The not so handy part is that HPC toolkit can be used on different platform, but not PAPI. And as I like PAPI…

Usage

Usage is really simple. Multithreaded applications will agglomerate all threads and even MPI processes separately, and in the viewer, the results would be shown with a specific [thread, process] index. This is really neat.

But before you can do that, there are a couple of commands to execute. The first one is hpcstruct, that just creates a XML file with all the known functions inside, named binary.hpcstruct:

Here there are 3 PAPI counters and one general counter (WALLCLOCK is relevant as current processors have a variable frequency, meaning that the total number of cycling is not enough to actually infer the spent time). After the @, there is the sampling frequency. I haven’t said what kind of profiling PAPI provides, it’s now clear it’s sampling. Sampling means that from time to time, the application will stop and the profiler will look at its state. If the sampling frequency (or in this case the sampling period) is too short, the profile will be biased and will show profiling errors, or it can even break if the sampling period is so small that it gets hit while doing the report…

A list of available counters can be gotten by hpcrun -L or papi_avail -a for PAPI counters.

Once this is done, the profile has to be converted in a “database”:

hpcprof binary -S binary.hpcstruct -I include_folder

The include folder is relevant if you want to browse through the source folders while analyzing your profile. There are two applications that can be used to review profiles, available on all platforms: hpcviewer or hpctraceviewer.

Here is an example of hpcviewer:

HPCViewer main window

There are three ways of displaying results:

Top-down, starting with top functions, diving into more specific functions, following the tree calls

Bottom-up, starting with all the leaves in the tree calls, a really informative view that captures the biggest offenders in an eye blink

A flat view, file by file, then function by function and finally line by line

Discussion

There is one main warning in these trees: if you compiled in optimized mode with GCC, some layers will be missing where gcc inlined functions. So it takes some agility to actually understand what happens. Another solution is to use the Intel Compiler which has an option to properly generate debug info. The case where this bugs me the most is for lambda functions, as this gets displayed as operator() function! Without the source code, there is no way of knowing where this function actually comes from.

Another thing to remember when checking values is what the third view gives. Each time a sample is taken, it’s at some point in the execution of the program, and the sampling period is given to the line where the program stopped/is sampled. It’s like ergodicity for a random variable, we assume that over enough samples, making a mean on the samples will be the same as making a mean on the timings. So the profiling time has to be important compared to the number of functions profiled and the sampling period.

Also when comparing with other tools like Parallel Studio or Valgrind, I miss at least a bar allowing me to compare different functions cost (that being for implicit or explicit counters). I think a callgrind-like profile can be created out of the HPC Toolkit profile. For the moment, it’s a xml file, but they plan on moving to another binary format…

I thought I could do something with the cvs export, but it only export timings for unfolded lines, without giving lines and file names, without any hierarchy, but it is doable from he original database, as they are stored as XML and as the callgrind file format is open.

One key advantage of HPC Toolkit over Callgrind is the way the profile is computed (sampling vs emulation), so they are faster to be created, and it shows a good picture of what happens within the processor, which an emulation profile can’t exactly show (it depends of the model of the processor, which is never exactly the right model). But it is not reproducible, it depends on the CPU load, so more care is required to analyze this profiles. Also it is only usable on Linux…

It is a great complement to Valgrind/Callgrind, and the fact that it is open source means that there are no excuse for not profiling your application for bottlenecks.

The demo is to repeat the experiment of PRML, page 480, Figure 10.6.
VB on BGMM has shown its capability of inferring the number of components automatically. It has converged
in 47 iterations.

The ELBO looks a little weired. It is not always going up. When some clusters disappear, ELBO goes down a little bit, then
go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less
than the number of features.

The BayesianGaussianMixture has much more parameters than GaussianMixture, there are six parameters per each components.
I feel it is not easy to control the so many functions and parameters. The initial design of BaseMixture is also not so good.
I took a look at bnpy which is a more complicated implementation of VB on various mixture
models. Though I don't need to go such complicated implementation, but the decoupling of observation model, i.e. $X$, $\mu$, $\Lambda$,
and mixture mode, i.e. $Z$, $\pi$ is quite nice. So I tried to use Mixin class to represent these two models. I split MixtureBase into three abstract classes ObsMixin, HiddenMixin and MixtureBase(ObsMixn, HiddenMixin). I also implemented subclasses
for Gaussian Mixture ObsGaussianMixin(ObsMixin), MixtureMixin(HiddenMixin), GaussianMixture(MixtureBase, ObsGaussianMixin, MixtureMixin), but Python does allow me to do this due to there is correct MRO. :-|. I changed them back, but this
unsuccessful experiment gives me a nice base class, MixtureBase.

I also tried to use cached_property to store the intermediate variables such as, $\ln \pi$, $\ln \Lambda$, and cholsky decomposed $ W-1 $, but didn't get much benefits. It is almost the same to save these variables as private attributes into instances.

I also implemented DirichletProcessGaussianMixture. But currently it looks the same as BayesianGaussianMixture.
Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration
than BayesianGaussianMixture. If we infer Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the
truncated level, only alpha the concentration parameter is enough. But with variational inference, we still need
the give the model the maximal possible number of components, i.e., the truncated level $T$.

I agree with a lot of what Lior says: most bioinformatics software is
not very good quality (#1), most bioinformatics software is not built
by a team (#2), licensing is at best a minor component of what makes
software widely used (#3), software should have an expiration date
(#5), most URLs are unstable (#6), software should not be "idiot
proof" (#7), and it shouldn't matter whether you use a specific
programming language (#8).

I strongly disagree with Lior's point #4, in almost every way. I try
make my software free for everyone, including companies, for both
philosophical reasons and for simplicity; I explained my reasoning in
my blog post.
(Anyone who doesn't think linking against GPL software is reasonably
complicated and nuanced should through the tweets and comments on that
post!) From my few involvements with working on non-free software, I
would also add that selling software is a tough business, and not one
that automatically leads to any profits; there's a long tail, just as
with everything else, and I long ago decided that my time is worth
more to me than the expected income from selling software would be.
(I would be thrilled if a student wanted to try to make money off of
our work, but my academic work would remain open source.)

What surprises me most about Lior's post, though, is that he's
describing the present situation rather accurately, but he's not angry
about it. I'm angry, frustrated, and upset by it, and I really
want to see a better future -- I'm in science, and biology, partly
because I think it can have a real impact on society and health.
Software is a key part of that.

Biology and genomics are changing. Large scale data analysis is
becoming more and more important to the biomedical sciences, and
software packages like kallisto and khmer are almost certainly going
to be used in the clinic at some point. (I believe some of Broad's
variant calling software is already used in diagnosis and treatment
for cancer, for example, although I don't know the details.) Our
software is certainly being used by people doing basic biomedical
research, although it may not be directly clinical yet - and I think
the quality of computation in basic research matters too.

And this means bioinformatics should grow up a bit. If
bioinformatics is a core component of the future of biology (which I
think is obvious), then the quality of bioinformatics software
matters.

To quote Lior, "Who wants to read junk software, let alone try to edit
it or build on it?" Certainly not me - but then why are we producing
it? Are we settling for this kind of software in biomedical research?
Are we just giving up on producing decent quality software altogether,
because, uh, it's hard? How is this different from doing bad math, or
publishing bad biology - topics that Lior and others get really mad
about?

Lior also quotes a Computational Biology interview with James Taylor,
who says,

A lot of traditional software engineering is about how to build
software effectively with large teams, whereas the way most
scientific software is developed is (and should be)
different. Scientific software is often developed by one or a
handful of people.

That was true in a decade ago, and it may have been a reasonable
reason to avoid using decent software engineering techniques then, but
the landscape has changed significantly in the last decade, with a
wide variety of rapid prototyping, test-driven development, and
lean/agile methodologies being put into practice in startups and large
companies. So I think James is mistaken here.

I wager that the reason a lot of scientists do bad software
engineering is because they can get away with it, not because there
are no techniques they could profitably use. Heck, if they wanted to
learn something about it, Software Carpentry will come teach workshops for you
on this very topic, and I'd be happy to offer both Lior and James a
workshop to bring them up to speed. (Note: I don't think either of
them needs my advice, which is actually kind of my point.)

(As for languages, Lior's point #8, there is a persistent expansion of
the Python and R toolchains around bioinformatics and a convergence on
them as the daily workhorses of bioinformatics data analysis. So even
that's changing.)

Fundamentally the blithe acceptance of badly engineered software in
science baffles me. I can understand (and even endorse)
not requiring good software engineering for algorithmic proofs of
concept, but clearly we want to have good, robust libraries for
serious work.
To claim otherwise would seem to lead to the conclusion that much of
bioinformatics and genomics should seek to be incorrect and
irrelevant.

I want there to be a robust community of computational scientists
and software developers in biology. I want people to be able to
build a new variant caller without having to reimplement a FASTQ
or SAM parser. I think we need people to file bug reports,
catch weird off-by-one problems, and otherwise spot check all the
software they are using. And I don't think it's impossible or even
terribly difficult to achieve this.

If we went back to the 80s and 90s we'd see that many of the same
arguments that Lior is making were applied to open source software in
contrast to commercial software. We know how that ended - open source
software now runs most of the Internet infrastructure. And open
source has had other benefits, too; to quote Bill Janeway, "open
source and the cloud have dramatically decreased the friction of
innovating", and the scientific community has certainly benefited from
the tremendous innovation in software and high tech. I would love to
see that same innovation enabled in genomics and bioinformatics. And
that's why we try to practice good software development in my lab;
that's why we release our software under the BSD license; and that's
why we encourage people to do whatever they want with our software.

Ultimately I think we should develop our software (and choose our
licenses), for the future we want to see, not the present that we're
stuck with.

July 09, 2015

Alexandre D'Aspremont and I are looking looking for a data scientist to work on large-scale non-smooth optimization methods and other topics. You can find more information in this link.

The job description is intentionally vague, and depending on the candidate this can be a postdoc-like job with most of their time devoted to research or an engineering-like job with more emphasis on coding tasks and contributing to open source projects.

July 07, 2015

Convolution is an algorithm that is often used for reverberations. If the equation is easy to understand and to implement, the implementation is costly. The other way of doing it is to use Fast Fourier Transform (FFT), but the direct/crude implementation requires latency. If it is possible to optimize the basic convolution code, it is sometimes more interesting to use a different algorithm, as it is the case here.

Basic theory

Let’s start with the convolution equation, with x[n] the input signal, y[n] the output signal and i[n] the impulse we want to convolve the input signal with:Convolution equation
The cost of the operation is the size of the impulse, O(K), which can be quite high. For N output elements, the cost is thus O(KN).

The other way of doing this is in the frequency domain, where the convolution will be a simple multiplication. The issue is that your require from the start N elements. The transform in the frequency domain is in O(M), with M the number of elements in the FFT. As we are talking about convolution, M=N+K-1. So as we have more output elements after the convolution, we will need to add them to the following convolution.

The issue is the selection of the size N. With a big N, we get efficiency, but we also require N elements before being able to output any signal at all. This is worrying, as there is no way of going around this limitation.

Rewriting the equation

Ok, now let’s rewrite the equation as a double sum. Let’s split the impulse in pieces with K elements (let’s say 64 elements, and the original length is a multiple of 64) with S pieces:Split convolution equation

What is interesting is we can also say that the x[k] in the equation can also be regrouped by K elements. This means that we have S convolutions of size K elements. There are several advantages to think about these input samples as pieces. The original algorithm had to have a memory of S*K elements. Either we use a circular buffer, or we have to shift the entire signal. Both solutions have their drawbacks. By regrouping them in pieces, there is no buffer management and no shifting the signal (or more precisely far less).

Unfortunately, we still have latency. We need K elements if we use FFT. Another approach is to process the first convolution in the usual way and the rest with FFTs. Each time we processed K samples, we create a new chunk, transform it in the frequency domain and push it in the list of FFT chunks. Then, we can directly multiply the partial impulses (transformed in the frequency domain) by their respective input chunks, sum them together and then transform the FFT back in the time domain. This means that after the transform we get 2*K elements, the first K elements will be added to the next K elements after the discreet convolution. The last K elements will have to be put aside and added to the first K elements of the next convolution.

Implementation in Audio Toolkit

This is a pretty basic approach to partitioned convolution. It’s just a rewrite of the convolution equation, and no fancy science behind.

The reference implementation will be from the FIR filter (just reverse the impulse). One of the nice feature of ATK is that you can make sure that you always get K input elements available in the buffer. Even if you are asked to process 1 element, the framework has a way of always keeping the last K values. This way, there is no additional need for bookkeeping. Once we processed K elements, we can directly create a new chunk. This is done through the following code in the main process method:

int processed_size = 0;
do
{
// We can only process split_size elements at a time, but if we already have some elements in the buffer,
// we need to take them into account.
int64_t size_to_process = std::min(split_size - split_position, size - processed_size);
// Process first part of the impulse
process_impulse_beginning(processed_size, size_to_process);
split_position += size_to_process;
processed_size += size_to_process;
if(split_position == split_size)
{
process_new_chunk(processed_size);
split_position = 0;
}
}while(processed_size != size);

process_new_chunk() triggers the copy of the last K input samples and computes a FFT on it. Once this is finished, the FFT result (2*K elements) is stored, and the last K elements of the FFT result are added to the buffer. This buffer part will be added to the convolution in process_impulse_beginning.

Some profiling

I always like to profile my code, so I used a small 10000 samples impulse (a ramp) and convoluted it with a triangle signal and compared the performance to the reference algorithm (the FIR filter). With pieces of 64 elements, the new implementation is 10 times faster, and it improves more if the pieces are bigger or the impulse longer:

First, a simple block convolution already improves timings by a factor of 3. Then, using FFTs to compute all the convolutions expect for the first, we can go further. Let’s see a profile of my original implementation:

Original profile with FFT convolution

Now, all the complex and vector calls inside the two convolution main methods can be removed to streamline the code:

Final profile after optimizations

Besides the optimization on the FFT code, the classic IIR code was also optimized somewhat, leading to a 15% performance increase. The conversion code was also enhanced leading to a gain for all plugins of more than 50% for the input/output handling.

Conclusion

There is still room for improvement, as there are several copies that are still required, but a basic implementation is actually easy to achieve. There is a compromise to be achieved between the computation of the convolution through FFT, dominated by the complex multiplication and accumulation before the inverse FFT, and the convolution through a FIR filter of the first block. If the impulse is long (several seconds), it seems that a size of 256 to 512 elements is the best compromise, whereas for shorter impulse (1 second), a block size of 64 elements is the best.

Of course, by improving the complex multiplication and addition (SIMD? FMA?), the compromise would shift towards smaller block sizes.

I didn’t put any big equations on how the block convolution is achieved. This would almost require a paper or a book chapter…

July 06, 2015

If a piece of bioinformatics software is not fully open source, my lab and I will generally seek
out alternatives to it for research, teaching and training. This holds
whether or not the software is free for academic use.

If a piece of bioinformatics software is only available under the GNU
Public License or another
"copyleft" license,
then my lab and I will absolutely avoid integrating any part of it
into our own source code, which is mostly BSD.

Why avoid non-open-source software?

We avoid non-open-source software because it saves us future
headaches.

Contrary to some assumptions, this is
not because I'm anti-company or against making money from software,
although I have certainly chosen to forego that in my own life. It's
almost entirely because such software is an evolutionary dead end, and
hence time spent working with it is ultimately wasted.

More specifically, here are some situations that I want to avoid:

we invest a lot of time in building training materials for a piece
of software, only to find out that some of our trainees can't make
use of the software in their day job.

we need to talk to lawyers about whether or not we can use a piece
of software or include some of its functionality in a workflow we're
building.

we find a small but critical bug in a piece of bioinformatics software,
and can't reach any of the original authors to OK a new release.

Our software, khmer, is
available under the BSD license. Any open source project (indeed,
any project) is welcome to take any part of our source code and
include it in theirs.

However, we cannot use GPL software in our code base at all. We can't
call GPL library functions, we can't include GPL code in our codebase,
and I'm not 100% sure we can even look closely at GPL code. This is
because if we do so, we must license our own software under the GPL.

Why did we choose BSD and not GPL for our own code?

Two reasons: first, I'm an academic, funded by government grants;
second, I want to maximize the utility of my work, which means
choosing a license that encourages the most participation in the
project, and encourages the most reuse of my code in other projects.

Jake covers the second line of reasoning really nicely in his blog
post,
so I will simply extract his summary of John Hunter's reasoning:

To summarize Hunter's reasoning: the most important two predictors
of success for a software project are the number of users and the
number of contributors. Because of the restrictions and subtle
legal issues involved with GPL licenses, many for-profit companies
will not touch GPL-licensed code, even if they are happy to
contribute their changes back to the community. A BSD license, on
the other hand, removes these restrictions: Hunter mentions several
specific examples of vital industry partnership in the case of
matplotlib. He argues that in general, a good BSD-licensed project
will, by virtue of opening itself to the contribution of private
companies, greatly grow its two greatest assets: its user-base and
its developer-base.

The first line of reasoning is a little more philosophical, but
basically it comes down to a wholesale rejection of the logic in the
Bayh-Dole act, which tries
to encourage innovation and commercialization of federally funded
research by assigning intellectual property to the university. I
think this approach is bollocks. While I am not an economic expert, I
think it's clear that most innovation in the university is probably
not worth that much and should be made maximally open. From talking to
Dr. Bill Janeway, I he
agrees that pre-Bayh-Dole was a time of more openness, although I'm
not sure of the evidence for more innovation during this period.
Regardless, to me it's intuitively obvious that the prospect of
commercialization causes more researchers to keep their research
closed, and I think this is obviously bad for science. (The Idea
Factory
talks a lot about how Bell Labs spurred immense amounts of innovation
because so much of their research was open for use. Talent Wants to
be Free is a
pop-sci book that outlines research supporting openness leading to
more innovation.)

So, basically, I think my job as an academic is to come up with cool
stuff and make it as open as possible, because that encourages innovation
and progress. And the BSD fits that bill. If a company wants to make
use of my code, that's great! Please don't talk to us - just grab it and
go!

I should say that I'm very aware of the many good reasons why GPL
promotes a better long-term future, and until I became a grad student
I was 100% on board. Once I got more involved in scientific
programming, though, I switched to a more selfish rationale, which is
that my reputation is going to be enhanced by more people using my
code, and the way to do that is with the BSD. If you have good
arguments about why I'm wrong and everyone should use the GPL, please
do post them (or links to good pieces) in the comments; I'm happy to
promote that line of reasoning, but for now I've settled on BSD for my
own work.

One important note: universities like releasing things under the GPL,
because they know that it virtually guarantees no company will use it
in a commercial product without paying the university to relicense it
specifically for the company. While this may be in the best
short-term interests of the university, I think it says all you need
to know about the practical outcome of the GPL on scientific
innovation.

pragmatically, Illumina is the only game in town for most of my
students, while there are plenty of RNA-seq analysis programs. So
unless I settled on kallisto being the super-end-all-be-all of
RNAseq analysis, I can indulge my leanings towards freedom by
ignoring kallisto and teaching something else that's free-er.

Illumina has a clear pricing model and their sequencing is essentially
a commodity that needs little to no engagement from me. This is not
generally how bioinformatics software works :)

There's no danger of Illumina claiming dibs on any of my results or
extensions - we're all clear that I pays my money, and I gets my
sequence. I'm honestly not sure what would happen if I modified
kallisto or built on it to do something cool, and then wanted to let
a company use it. (I bet it would involve talking to a lot of
lawyers, which I'm not interested in doing.)

So that's my reasoning. I don't want to pour fuel on any licensing
fire, but I wanted to explain my reasoning to people. I also think
that people should fight hard to make their bioinformatics software
available under a permissive license, because it will benefit everyone
:).

(or, Yes You Can Fit Models With More Parameters Than Data Points)

An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.

A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.

The Root of the Model Complexity Myth

While the model complexity myth is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive. In this section I first want to motivate the reason for the underdetermination issue in simple linear models, first from an intuitive view, and then from a more mathematical view.

I'll start by defining some functions to create plots for the examples below; you can skip reading this code block for now:

Fitting a Line to Data

The archetypical model-fitting problem is that of fitting a line to data: A straight-line fit is one of the simplest of linear models, and is usually specified by two parameters: the slope m and intercept b. For any observed value \(x\), the model prediction for \(y\) under the model \(M\) is given by

\[
y_M = m x + b
\]

for some particular choice of \(m\) and \(b\). Given \(N\) obverved data points \(\{x_i, y_i\}_{y=1}^N\), it is straightforward (see below) to compute optimal values for \(m\) and \(b\) which fit this data:

In [2]:

plot_simple_line()

The simple line-plus-intercept is a two-parameter model, and it becomes underdetermined when fitting it to fewer than two datapoints. This is easy to understand intuitively: after all, you can draw any number of perfectly-fit lines through a single data point:

In [3]:

plot_underdetermined_line()

The single point simply isn't enough to pin-down both a slope and an intercept, and the model has no unique solution.

Fitting a More General Linear Model

While it's harder to see intuitively, this same notion extends to models with more terms. For example, let's think consider fitting a general cubic curve to data. In this case our model is

\[
y_M = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3
\]

Note that this is still a linear model: the "linear" refers to the linearity of model parameters \(\theta\) rather than linearity of the dependence on the data \(x\). Our cubic model is a four-parameter linear model, and just as a two-parameter model is underdetermined for fewer than two points, a four-parameter model is underdetermined for fewer than four points. For example, here are some of the possible solutions of the cubic model fit to three randomly-chosen points:

In [4]:

plot_underdetermined_cubic()

For any such simple linear model, an underdetermined system will lead to a similar result: an infinite set of best-fit solutions.

The Mathematics of Underdetermined Models

To make more progress here, let's quickly dive into the mathematics behind these linear models. Going back to the simple straight-line fit, we have our model

\[
y_M(x~|~\theta) = \theta_0 + \theta_1 x
\]

where we've replaced our slope \(m\) and intercept \(b\) by a more generalizable parameter vector \(\theta = [\theta_0, \theta_1]\). Given some set of data \(\{x_n, y_n\}_{n=1}^N\) we'd like to find \(\theta\) which gives the best fit. For reasons I'll not discuss here, this is usually done by minimizing the sum of squared residuals from each data point, often called the \(\chi^2\) of the model in reference to its expected theoretical distribution:

\[
\chi^2 = \sum_{n=1}^N [y_n - y_M(x_n~|~\theta)]^2
\]

We can make some progress by re-expressing this model in terms of matrices and vectors; we'll define the vector of \(y\) values:

\[
y = [y_1, y_2, y_3, \cdots y_N]
\]

We'll also define the design matrix which we'll call \(X\); this contains all the information about the form of the model:

With this formalism, the vector of model values can be expressed as a matrix-vector product:

\[
y_M = X\theta
\]

and the \(\chi^2\) can be expressed as a simple linear product as well:

\[
\chi^2 = (y - X\theta)^T(y - X\theta)
\]

We'd like to minimize the \(\chi^2\) with respect to the parameter vector \(\theta\), which we can do by the normal means of differentiating with respect to the vector \(\theta\) and setting the result to zero (yes, you can take the derivative with respect to a vector!):

\[
\frac{d\chi^2}{d\theta} = -2X^T(y - X\theta) = 0
\]

Solving this for \(\theta\) gives the Maximum Likelihood Estimate (MLE) for the parameters,

\[
\hat{\theta}_{MLE} = [X^T X]^{-1} X^T y
\]

Though this matrix formalism may seem a bit over-complicated, the nice part is that it straightforwardly generalizes to a host of more sophisticated linear models. For example, the cubic model considered above requires only a larger design matrix \(X\):

The added model complexity is completely encapsulated in the design matrix, and the expression to compute \(\hat{\theta}_{MLE}\) from \(X\) is unchanged!

Why Underdetermined Models Break

Taking a look at this Maximum Likelihood solution for \(\theta\), we see that there is only one place that it might go wrong: the inversion of the matrix \(X^T X\). If this matrix is not invertible (i.e. if it is a singular matrix) then the maximum likelihood solution will not be well-defined.

The number of rows in \(X\) corresponds to the number of data points, and the number of columns in \(X\) corresponds to the number of parameters in the model. It turns out that a matrix \(C = X^TX\) will always be singular if \(X\) has fewer rows than columns, and this is the source of the problem. For underdetermined models, \(X^TX\) is a singular matrix, and so the maximum likelihood fit is not well-defined.

Let's take a look at this in the case of fitting a line to the single point shown above, \((x=0.37, y=0.95)\). For this value of \(x\), here is the design matrix:

In [5]:

X=np.array([[1,0.37]])

We can use this to compute the normal matrix, which is the standard name for \(X^TX\):

In [6]:

C=np.dot(X.T,X)

If we try to invert this, we will get an error telling us the matrix is singular:

Evidently, if we want to fix the underdetermined model, we'll need to figure out how to modify the normal matrix so it is no longer singular.

Fixing an Underdetermined Model: Conditioning

One easy way to make a singular matrix invertible is to condition it: that is, you add to it some multiple of the identity matrix before performing the inversion (in many ways this is equivalent to "fixing" a divide-by-zero error by adding a small value to the denominator). Mathematically, that looks like this:

\[
C = X^TX + \sigma I
\]

For example, by adding \(\sigma = 10^{-3}\) to the diagonal of the normal matrix, we condition the matrix so that it can be inverted:

This conditioning caused the model to settle on a particular one of the infinite possibilities for a perfect fit to the data. Numerically we have fixed our issue, but this arbitrary conditioning is more than a bit suspect: why is this particular result chosen, and what does it actually mean in terms of our model fit? In the next two sections, we will briefly discuss the meaning of this conditioning term from both a frequentist and Bayesian perspective.

Frequentist Conditioning: Regularization

In a frequentist approach, this type of conditioning is known as regularization. Regularization is motivated by a desire to penalize large values of model parameters. For example, in the underdetermined fit above (with \((x, y) = (0.37, 0.95)\)), you could fit the data perfectly with a slope of one billion and an intercept near negative 370 million, but in most real-world applications this would be a silly fit. To prevent this sort of canceling parameter divergence, in a frequentist setting you can "regularize" the model by adding a penalty term to the \(\chi^2\):

\[
\chi^2_{reg} = \chi^2 + \lambda~\theta^T\theta
\]

Here \(\lambda\) is the degree of regularization, which must be chosen by the person implementing the model.

Using the expression for the regularized \(\chi^2\), we can minimize with respect to \(\theta\) by again taking the derivative and setting it equal to zero:

This leads to the following regularized maximum likelihood estimate for \(\theta\):

\[
\hat{\theta}_{MLE} = [X^TX + \lambda I]^{-1} X^T y
\]

Comparing this to our conditioning above, we see that the regularization degree \(\lambda\) is identical to the conditioning term \(\sigma\) that we considered above. That is, regulariation of this form is nothing more than a simple conditioning of \(X^T X\), with \(\lambda = \sigma\). The result of this conditioning is to push the absolute values of the parameters toward zero, and in the process make an ill-defined problem solvable.

I'll add that the above form of regularization is known variably as L2-regularization or Ridge Regularization, and is only one of the possible regularization approaches. Another useful form of regularization is L1-regularization, also known as Lasso Regularization, which has the interesting property that it favors sparsity in the model.

Bayesian Conditioning: Priors

Regularization illuminates the meaning of matrix conditioning, but it still sometimes seems like a bit of black magic. What does this penalization of model parameters within the \(\chi^2\) actually mean? Here, we can make progress in understanding the problem by examining regularization from a Bayesian perspective.

As I pointed out in my series of posts on Frequentism and Bayesianism, for many simple problems, the frequentist likelihood (proportional to the negative exponent of the \(\chi^2\)) is equivalent to the Bayesian posterior – albeit with a subtlely but fundamentally different interpretation.

The Bayesian posterior probability on the model parameters \(\theta\) is given by

So we see that ridge regularization is equivalent to applying a Gaussian prior to your model parameters, centered at \(\theta=0\) and with a width \(\sigma_P = (2\lambda)^{-2}\). This insight lifts the cover off the black box of regularization, and reveals that it is simply a roundabout way of adding a Bayesian prior within a frequentist paradigm. The stronger the regularization, the narrower the implicit Gaussian prior is.

Returning to our single-point example above, we can quickly see how this intuition explains the particular model chosen by the regularized fit; it is equivalent to fitting the line while taking into account prior knowledge that both the intercept and slope should be near zero:

In [11]:

plot_conditioned_line()

The benefit of the Bayesian view is that it helps us understand exactly what this conditioning means for our model, and given this understanding we can easily extend use more general priors. For example, what if you have reason to believe your slope is near 1, but have no prior information on your intercept? In the Bayesian approach, it is easy to add such information to your model in a rigorous way.

But regardless of which approach you use, this central fact remains clear: you can fit models with more parameters than data points, if you restrict your parameter space through the use of frequentist regularization or Bayesian priors.

Underdetermined Models in Action

There are a few places where these ideas about underdetermined models come up in real life. I'll briefly discuss a couple of them here, and then walk through the solution of a simple (but rather interesting) problem that demonstrates these ideas.

Compressed Sensing: Image Reconstruction

One area where underdetermined models are often used is in the field of Compressed Sensing. Compressed sensing comprises a set of models in which underdetermined linear systems are solved using a sparsity prior, the classic example of which is the reconstruction of a full image from just a handful of its pixels. As a simple linear model this would fail, because there are far more unknown pixels than known pixels. But by carefully training a model on the structure of typical images and applying priors based on sparsity, this seemingly impossible problem becomes tractable. This 2010 Wired article has a good popular-level discussion of the technique and its applications, and includes this image showing how a partially-hidden input image can be iteratively reconstructed from a handful of pixels using a sparsity prior:

Kernel-based Methods: Gaussian Processes

Another area where a classically underdetermined model is solved is in the case of Gaussian Process Regression. Gaussian Processes are an interesting beast, and one way to view them is that rather than fitting, say, a two-parameter line or four-parameter cubic curve, they actually fit an infinite-dimensional model to data! They accomplish this by judicious use of certain priors on the model, along with a so-called "kernel trick" which solves the infinite dimensional regression implicitly using a finite-dimensional representation constructed based on these priors.

In my opinion, the best resource to learn more about Gaussian Process methods is the Gaussian Processes for Machine Learning book, which is available for free online (though it is a bit on the technical side). You might also take a look at the scikit-learn Gaussian Process documentation. If you'd like to experiment with a rather fast Gaussian Process implementation in Python, check out the george library.

Imperfect Detectors: Extrasolar Planets

Another place I've seen effective use of the above ideas is in situations where the data collection process has unavoidable imperfections. There are two basic ways forward when working with such noisy and biased data:

you can pre-process the data to try to remove and/or correct data imperfections, and then fit a conventional model to the corrected data.

you can account for the data imperfections along with interesting model parameters as part of a unified model: this type of unified approach is often preferable in scientific settings, where even the most careful pre-processing can lead to biased results.

If you'd like to see a great example of this style of forward-modeling analysis, check out the efforts of David Hogg's group in finding extrasolar planets in the Kepler survey's K2 data; there's a nice astrobytes post which summarizes some of these results. While the group hasn't yet published any results based on truly underdetermined models, it is easy to imagine how this style of comprehensive forward-modeling analysis could be pushed to such an extreme.

Example: Fitting a Model to Biased Data

While it might be fun to dive into the details of models for noisy exoplanet searches, I'll defer that to the experts. Instead, as a more approachable example of an underdetermined model, I'll revisit a toy example in which a classically underdetermined model is used to account for imperfections in the input data.

Imagine you have some observed data you would like to model, but you know that your detector is flawed such that some observations (you don't know which) will have a bias that is not reflected in the estimated error: in short, there are outliers in your data. How can you fit a model to this data while accounting for the possibility of such outliers?

To make this more concrete, consider the following data, which is drawn from a line with noise, and includes several outliers:

This reflects a well-known deficiency of \(\chi^2\) minimization: it is not robust to the presence of outliers.

What we would like to do is propose a model which somehow accounts for the possibility that each of these points may be the result of a biased measurement. One possible route is to add \(N\) new model parameters: one associated with each point which indicates whether it is an outlier or not. If it is an outlier, we use the standard model likelihood; if not, we use a likelihood with a much larger error. The result for our straight-line fit will be a model with \(N + 2\) parameters, where \(N\) is the number of data points. An overzealous application of lessons from simple linear models might lead you to believe this model can't be solved. But, if carefully constructed, it can!

Let's show how it can be done.

Our linear model is:

\[
y_M(x~|~\theta) = \theta_0 + \theta_1 x
\]

For a non-outlier (let's call it an "inlier") point at \(x\), \(y\), with error on \(y\) given by \(dy\), the likelihood is

We will put a prior on these indicator variables \(g\) which encourages sparsity of outliers; this can be accomplished with a simple L1 prior, which penalizes the sum of the \(g\) terms:

\[
P(g) = \exp\left[-\sum_i g_i\right]
\]

where, recall, \(g_i \in \{0, 1\}\).

Though you could likely solve for a point estimate of this model, I find the Bayesian approach to be much more straightforward and interpretable for a model this complex. To fit this, I'll make use of the excellent emcee package. Because emcee doesn't have categorical variables, we'll instead allow \(g_i\) to range continuously between 0 and 1, so that any single point will be some mixture of "outlier" and "inlier".

We start by defining a function which computes the log-posterior given the data and model parameters, using some computational tricks for the sake of floating-point accuracy:

In [15]:

# theta will be an array of length 2 + N, where N is the number of points# theta[0] is the intercept, theta[1] is the slope,# and theta[2 + i] is the weight g_ideflog_prior(theta):g=theta[2:]#g_i needs to be between 0 and 1if(np.any(g<0)ornp.any(g>1)):return-np.inf# recall log(0) = -infelse:return-g.sum()deflog_likelihood(theta,x,y,dy):sigma_y=np.std(y)y_model=theta[0]+theta[1]*xg=np.clip(theta[2:],0,1)# g<0 or g>1 leads to NaNs in logarithm# log-likelihood for in-lierlogL_in=-0.5*(np.log(2*np.pi*dy**2)+((y-y_model)/dy)**2)# log-likelihood for outlierlogL_out=-0.5*(np.log(2*np.pi*sigma_y**2)+((y-y_model)/sigma_y)**2)returnnp.sum(np.logaddexp(np.log(1-g)+logL_in,np.log(g)+logL_out))deflog_posterior(theta,x,y,dy):returnlog_prior(theta)+log_likelihood(theta,x,y,dy)

Now we use the emcee package to run this model. Note that because of the high dimensionality of the model, the run_mcmc command below will take a couple minutes to complete:

In [16]:

importemceendim=2+len(x)# number of parameters in the modelnwalkers=50# number of MCMC walkersnburn=10000# "burn-in" period to let chains stabilizensteps=15000# number of MCMC steps to take# set walkers near the maximum likelihood# adding some random scatterrng=np.random.RandomState(0)starting_guesses=np.zeros((nwalkers,ndim))starting_guesses[:,:2]=rng.normal(theta1,1,(nwalkers,2))starting_guesses[:,2:]=rng.normal(0.5,0.1,(nwalkers,ndim-2))sampler=emcee.EnsembleSampler(nwalkers,ndim,log_posterior,args=[x,y,dy])sampler.run_mcmc(starting_guesses,nsteps)sample=sampler.chain# shape = (nwalkers, nsteps, ndim)sample=sampler.chain[:,nburn:,:].reshape(-1,ndim)

The red circles mark the points that were determined to be outliers by our model, and the black line shows the marginalized best-fit slope and intercept. For comparison, the grey line is the standard maximum likelihood fit.

Notice that we have successfully fit an \((N + 2)\)-parameter model to \(N\) data points, and the best-fit parameters are actually meaningful in a deep way – the \(N\) extra parameters give us individual estimates of whether each of the \(N\) data points has misreported errors. I think this is a striking example of how a model that would be considered impossible under the model complexity myth can be solved in practice to produce very useful results!

Conclusion

I hope you will see after reading this post that the model complexity myth, while rooted in a solid understanding of simple linear models, should not be assumed to apply to all possible models. In fact, it is possible to fit models with more parameters than datapoints: and for the types of noisy, biased, and heterogeneous data we often encounter in scientific research, you can make a lot of progress by taking advantage of such models. Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

July 03, 2015

We would like to announce the release of version 0.7.0 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.This version of Sumatra brings some major improvements for users, including an improved web browser interface, improved support for the R language, Python 3 compatibility, a plug-in interface making Sumatra easier to extend and customize, and support for storing data using WebDAV.

In addition, there have been many changes under the hood, including a move to Github and improvements to the test framework, largely supported by the use of Docker.

Last but not least, we have changed licence from the CeCILL licence (GPL-compatible) to a BSD 2-Clause Licence, which should make it easier for other developers to use Sumatra in their own projects.

Updated and extended web interface

Thanks to Felix Hoffman’s Google Summer of Code project, the web browser interface now provides the option of viewing the history of your project either in a “process-centric” view, as in previous versions, where each row in the table represents a computation, or in a “data-centric” view, where each row is a data file. Where the output from one computation is the input to another, additional links make it possible to follow these connections.

The web interface has also had a cosmetic update and several other improvements, including a more powerful comparison view (see screenshot). Importantly, the interface layout no longer breaks in narrower browser windows.

BSD licence

The Sumatra project aims to provide not only tools for scientists as end users (such as smt and smtweb), but also library components for developers to add Sumatra’s functionality to their own tools. To support this second use, we have switched licence from CeCILL (GPL-compatible) to the BSD 2-Clause Licence.

Python 3 support

In version 0.6.0, Sumatra already supported provenance capture for projects using Python 3, but required Python 2.6 or 2.7 to run. Thanks to Tim Tröndle, Sumatra now also runs in Python 3.4.

Plug-in interface

To support the wide diversity of workflows in scientific computing, Sumatra has always had an extensible architecture. It is intended to be easy to add support for new database formats, new programming languages, new version control systems, or new ways of launching computations.

Until now, adding such extensions has required that the code be included in Sumatra’s code base. Version 0.7.0 adds a plug-in interface, so you can define your own local extensions, or use other people’s.

WebDAV support

The option to archive output data files has been extended to allow archiving to a remote server using the WebDAV protocol.

Support for the R language

Sumatra will now attempt to determine the versions of all external packages loaded by an R script.

Other changes

For developers, there has been a significant change - the project has moved from Mercurial to Git, and is now hosted on Github. Testing has also been significantly improved, with more system/integration testing, and the use of Docker for testing PostgreSQL and WebDAV support.

Parsing of command-line parameters has been improved. The ParameterSet classes now have a diff() method, making it easier to see the difference between two parameter sets, especially for large and hierarchical sets.

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you. I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

July 02, 2015

Continuum is a sponsor of this year’s SciPy Conference in Austin, TX. We invite
you to join us at the fourteenth annual Scientific Computing with Python conference, and to check out some really fantastic talks and
tutorials from our talented engineers and developers.

The Continuum Team is hitting some big conferences this month, and we hope you’ll join us for exciting talks on Bokeh, Numba,
Blaze, Anaconda, and more. From our hometown of Austin, Texas all the way to Spain, our team is ready to talk Python.

July 01, 2015

I'm happy to announce that Python(x, y) 2.7.10.0 is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the Python 3.x 64 bit version will resume once Visual Studio 2015 RTM is released.

What's new in general:

All packages have been updated to their latest versions. Numpy, Scipy etc.

ipython has been held back at 2.4.1 to avoid potential compatibility issues.

OpenCV has been help back at 2.4.11 to avoid potential compatibility issues.

June 30, 2015

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:Transient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:Transient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found on Github here: https://github.com/mbrucher/AudioTK/blob/master/Examples/dynamic/display_transient_shaper.py

June 29, 2015

The week 5 began with a discussion with whether we should deprecate params.
I fixed some bugs in checking functions, random number generator and one of covariance updating methods.
In the following days, I completed the main functions of GaussianMixutre and all test cases, except
AIC, BIC and sampling functions. The tests are some kind of challenging, sine the current implementation
in the master branch contains very old test cases imported from Weiss's implementation which is never
got improved. I simplified the test cases, and wrote more tests that are not covered by the current implementation,
such as covariance estimation, ground truth parameter prediction, and other user-friendly warnings and errors.

June 26, 2015

Using Fortran code from Python is an old story. The major player in the game is
f2py that is used in NumPy, the array library for
Python. f2py being used in the SciPy library, it is widely
distributed.

The point of f2py is to assist a developer to embed regular Fortran code in a
Python module. f2py was developped before the existence of the iso_c_binding
Fortran extension that provides a consistent interface between Fortran and C.
Since iso_c_binding exists (i.e. from Fortran 2003 onwards), there is a viable
platform independent mechanism to use Fortran code from Python.

I'll add a contribution to the build issues. In very short: the regular Python
extension mechanism does not handle Fortran natively and a solution is to build
your Fortran code as a library. Then, just bind a Cython to that library and
you're set!

This is great; these are simple problems to solve efficiently in parallel.
Generally these simple computations occur at the beginning of our analyses.

Sophisticated Algorithms can be Complex

Later in our analyses we want more complex algorithms for statistics
, machine learning, etc.. Often this stage fits
comfortably in memory, so we don’t worry about parallelism and can use
statsmodels or scikit-learn on the gigabyte result we’ve gleaned from
terabytes of data.

However, if our reduced result is still large then we need to think about
sophisticated parallel algorithms. This is fresh space with lots of exciting
academic and software work.

This algorithm computes the exact SVD (up to numerical precision) of a large
tall-and-skinny matrix in parallel in many small chunks. This allows it to
operate out-of-core (from disk) and use multiple cores in parallel. At the
bottom we see the construction of our trivial array of ones, followed by many
calls to np.linalg.qr on each of the blocks. Then there is a lot of
rearranging of various pieces as they are stacked, multiplied, and undergo more
rounds of np.linalg.qr and np.linalg.svd. The resulting arrays are
available in many chunks at the top and second-from-top rows.

So to write complex parallel algorithms we write down dictionaries of tuples of
functions.

The dask schedulers take care of executing this graph in parallel using
multiple threads. Here is a profile result of a larger computation on a
30000x1000 array:

Low Barrier to Entry

Looking at this graph you may think “Wow, Mariano is awesome” and indeed he is.
However, he is more an expert at linear algebra than at Python programming.
Dask graphs (just dictionaries) are simple enough that a domain expert was able
to look at them say “Yeah, I can do that” and write down the very complex
algorithms associated to his domain, leaving the execution of those algorithms
up to the dask schedulers.

You can see the source code that generates the above graphs
on GitHub.