1 Cleaning up fasterlmmd

This week I have been working with Prasun to clean up
fasterlmmd. Mostly renaming stuff and introducing immutable datatypes
which brought out a number of issues. The current code base is single
threaded and meant to be easy to read. It is already faster than the
old pylmm.

I read again Artem's fast-lmm BLOG and Karl's. Next I went again
through the current code bases of pylmm, lmmlite (Karl), Lippert and
OmicAble and I remember why we are doing this again: they are all
lacking in the engineering department. Karl's implementation is the
best one, I think. If he had not used R and C++ I would have gone for
that. But at least it is short and clear and we can use it as a
reference (it was based on pylmm too).

After cleaning up we will implement a CUDA version using cuBLAS which
supposedly can handle matrices larger than GPU RAM now (alternatively
decompose) and attach that to the REST API. After that we should be
able to get it into GN2.

And then on to the Phi! We are ordering a KNL for the purpose.

2 Beacon and Phi

We are ORNL with the JICS team to get some of our tools working on the
Beacon supercomputer. Login in to Beacon and jump to a node

3 fast-lmm-d

[2017-03-16 Thu] Over the last months we have put in a lot of work to
make fast-lmm-d happen, the single core version is faster than the
python version, even without real optimizations. We are now using the
profiler to see where we can do even better.

And I wrote relocatable GNU Guix support to deploy packages on
Beacon. Video of FOSDEM2017 presentation in the HPC room.

4 Beacon: getting Python to run with MKL

but now I find numpy is linked against OpenBLAS instead of MKL! Well,
that is useful for benchmarking, but not what I wanted to do.

The Intel numpy build instructions are somewhat enlightening though
they do not mention Phi as a target. What is interesting is explicit
linking against MKL and that it is even possible to use the GNU
compilers. The gcc provided is

gcc (GCC) 4.8.2 Copyright (C) 2013

which should be good enough to bootstrap Python and even Guix and
friends.

5 Beacon: what software to use?

[2016-10-20 Thu] Beacon support wrote:

`Sorry about the empty module. We did support OpenCL at a time (when
intel MPSS is at 3.3), but AFAIK the intel doesn't officially support
OpenCL for MIC beyond MPSS 3.3. On Beacon we currently have MPSS 3.7.'

and suggests to use for Python Intel developer's pyMIC library
(https://github.com/01org/pyMIC). In this library Python is bound
against Intel's MKL library, see for example the BLAS functions.

Of course, the MKL is fast. But (like CUDA) it is propriety
closed source software which makes it a no go beyond the short term
for me.

By the looks of it, as stated by Beacon support, Intel stopped support
for OpenCL on Phi in favour of its own 'Manycore Platform Software
Stack' which is nasty because it rules out all FOSS. In fact, Boost,
BLAST and other tools on Beacon all want OpenCL support! And Beacon
even has the intel-opencl compiler (which won't work without OpenCL
drivers). The advantage of OpenCL is that it is easier for programmers
to target multiple platforms. With my software it is not too hard to
target different backends explicitely.

My conclusion: all non-Intel solutions on Beacon require significant
software compilation and dependency resolution (read deployment) at
this point. For example to get OpenCL going is non-trivial and GNU
Guix would be a real help. Once bootstrapped I could even use D and
LLVM. Also we may need to build Erlang and Elixir - something Guix
is great at.

The current LMM implementation is in Python with Numpy, already
targetting BLAS on CUDA. To get LMM running on Phi there are 3 routes
now:

Python+MKL (1) appears to be the easiest, provided I get pyMIC or
Numpy-mkl to compile. OpenCL (2) would be great to have to be able to
roll out to different platforms - a true FOSS solution. D+LLVM (3) is
the experimental one with great scope for performance
improvements. Both (2) and (3) would need GNU Guix to succeed, in my
opinion.

It looks like the pre-installed Python on Beacon already supports MKL
to some degree:

So, if these libraries know how to farm out to the Phi's we should be
good to go. At least, it is worth exploring the default Python
installation first with numpy. In parallel I'll try to install Guix
packages using Roel's method so at least I'll be able to generate my
own compiler environment (we'll probably want this for Elixir support
too). Roel's method essentially has one build Guix packages on a build
host and copy and run them in $HOME. This means Guix itself can not be
run on Beacon, but the packages can, including a more recent gcc. I'll
use Penguin2 to targe my $HOME which is /nics/b/home/pjotrp.

6 First steps on the Beacon Intel Phi supercomputer

The first step was getting access again. I had locked myself out
somehow… But that was fixed in about 6 hours. I have ssh access
again. One of the first things to check is that Beacon is running a
2012 Linux kernel using Red Hat 4.4.7.

For OpenCL Beacon have the intel-opencl/1.2-3.0.67279 compiler. Also
PLASMA is provided as the plasma and dplasma modules, a higher level
linalg library. There is Magma, another Linalg lib. Boost is also
available, but lacks OpenCL support (yet - coming in v1.6.1). Then
there is a JVM targetting the Phi. Hmmm. Choice, but not (exactly)
what I am looking for.

The first step is to see if I can get a Python to work with OpenCL
using the intel-opencl compiler. I found there is a module

7 Progress on GN2

[2016-10-18 Tue] Many good things happening on GN2. New datasets are
being added at a rapid rate. We have a new genotype map for the
BXD. The REST API gives access to almost all data. A genome browser
has been embedded and the list just goes on… The coming period I am
going to focus on running a pilot on a super computer named Beacon.

8 Updating the deployment of GN2

[2016-08-16 Tue] Christian has done a lot of work in getting BD ready for GN2. My job
is to get it on staging by updating the Guix packages. The last
update was from February. Unfortunately Guix was misbehaving. The GNU
servers are giving errors and the latest checkout of the Guix tree is
also erroneous. One can live with one or the other but not both! I spent
quite a few hours and rolled back to a version from a Month ago to start
rebuilding GN2. One of the first problems I encountered was

now python-pil I packaged and it is really, really old. According to
this python-pillow is the modern package and, in principle a drop-in
replacement. It is also python3 compatible: we have to aim for
that. So, my first attempt is simply replacing pil with pillow. I'll
probably run into a few issues in GN2.

This error is coming from piddle, another really old package from the
nineties - latest version from 2002. We should really get rid of it,
but it is used by the GN1-style plot and (apparently) a plink plot.

Anyway, after a long look, I decided it was easier to fix the old
python-pil package so it can find fterrors.h again. Isn't deployment
fun?!

Apart from that major changes are R running at 3.3.0, and also I also
did critical updates of python-numpy, python-scipy and openblas (see
bug in OpenBLAS). I'll also upgrade R/qtl and add a new package for
the excellent R/qtl2. There are more upgrades, mostly because thanks
to the rolling updates of GNU Guix.q

The short and long of it is that GN2 builds again using a recent
checkout of GNU Guix. Mind, I also updated plink-ng in the current
build. Some of the parallel tests fail. May need to fix that before we
roll into production (and add a test in mechanize).

9 Getting the latest pylmm-gn2 into GN2

The current version of pylmm we are using is my multi-core build from
May 2015. That is over a year ago and does not include the CUDA
features I added later. The main reason for not updating was a bug in
OpenBLAS and that the new code has a different data transport
requirement. Now we have the new gnserver running I am ready to bring
the CUDA stuff into GN2. First, I tested that the multi-core version
gives the same results as the earlier one

so, with Karl's previous lmmlite-pylmm comparison, we should be
reasonably confident that pylmmgn2 is doing the right job. The
round-off difference is interesting, and due to one extra SNP (I
think I remember one SNP was excluded in the original). Actually,

10 Updating Genotype data for the BXD and reproducibility

The REST API serves genotypes. Apparently the version in GN2 was
outdated, for one BXD103 was removed/renamed. To ascertain
reproducibility we need to introduce rigorous versioning and
we need to be able to serve the different versions.

The genotypes exist as files. So can easily calculate an MD5 checksum
and pass that in as metadata. We will also name the files using the
partial checksum and publishing data, so BXD.csv becomes, for example,
BXD\20150722\83771d79.csv. Passing this to the user (in the form of
meta data) will allow the to uniquely identify the dataset for the
purpose of reproducibility. These values are captured in the
downloadable metadata BXD.json (also a file). The YAML version:

Phenotypes, meanwhile, are stored in a 'rolling' SQL database. This
raises challenges for reproducibility. Some people have the opinion
that it does not matter (the latest is the so-called greatest - an
opinion also often voiced on the semantic web), i.e., they argue the
researcher should archive the base data that was used for the
analysis. That is possible, but I get uncomfortable when data changes
just like that.

What we want is some way to signal a change to a dataset.

My proposal is to send a HASH value and a date with the phenotypes in
the form of meta data. This will tell the researcher that a dataset
has changed - if not how (if an archive exists a comparison should
be possible). To achieve this we will add columns to the dataset
description table and use and external tool to regularly test the
datasets for changes. When a change happens and the HASH value has not
been updated an alarm needs to be raised.

11 A great summer of code

[2016-08-04 Thu] Time to update this feed. The last two months saw
plenty of activity. We have a publication in JOSS. Zach has introduced
sessions into GN2. We have a scalable REST server running and we are
adding functionality as we go. And Christian is adding the
biodalliance genome browser to GN2, now with genotype and QTL tracks.

In the next phase we'll be adding improved LMM support, more tracks
and software running in heterogeneous environments.

12 REST: fetching phenotypes (2)

[2016-05-24 Tue] Lots of activity on the #genenetwork IRC channel now
the Google Summer of code started. Also distracting, of course, but I
think good things come out of it. I am reverse engineering the GN
database to provide the REST services.

13 Adding a multi-phenotype QTL plot

[2016-05-23 Mon] Harm has modifed Karl's QTL plot so it can show
multiple QTL plots in one figure. I am looking at embedding this form
into GN. Basically the user selects an experiment and a gene
(presumably as an expression phenotype) and has it looks for all genes
that correlate. At this point I am not so interested in the logic of
correlations, but I am interested in the plot itself which could plot
multiple phenotypes in one GN collection (read bag of phenotypes). I
like also the way that it adjusts the chromosome sizes.

Looking at the plot's source code this should be possible.

Interestingly Karl has also a mutliplot in chart 3 but written in
clean coffeescript. I am severely tempted to use that one. Harm agreed
to look into it and maybe clean up the code. It would be better to
make it so that Karl can reuse it.

14 REST: fetching phenotypes

[2016-05-18 Wed] We are adding a new REST service to GN. Main reason
is to provide data to people using R, Python, Ruby etc. Introducing
the REST server is also an opportunity for splitting functionality out
of the main python webserver. One thing we want to avoid is long
running jobs in python as it blocks on those. The third reason for
a REST interface is to provide WEB 2.0 support to the genome browser
and other UI tools. I started writing a maru based REST server.

15 Everyone on Guix

[2016-05-15 Sun] We have a Guix powered staging server now running on
http://test-gn2.genenetwork.org/. Others have been installing and we
will all be byte-identical and reproducible soon.

Now the main challenge is to get Zach on my version. First step is
to update Guix on penguin followed by the guix genenetwork2 install. This
pulls in all packages except for ctl.

Dennis updated the r-preprocesscore package - used for something (it
is unbelievable what gets pulled in by all these tools) - so that
triggers a rebuild of some of the dependencies:

For Zach I set up the new repositories: genenetwork2 and all
dependencies (except Danny's CTL) are now on penguin, so we can start
synchronizing our efforts using GNU Guix. To use guix add
~/.guix-profile/bin to the PATH. Install system-wide guix
packages, e.g.

guix package -i hello

For deployment we use a slightly modified setup, involving two git
repositories as described here. Now

You can see we set the TEMPDIR now and we use a settings file (take a
look inside ~/defaultsettings.py) that is similar to the old one.

The command started up the new server on port 5003. Your original
server is running on port 5002, so you may want to change that setting
or ask Lei to make 5003 visible from outside.

See if you can get it running.

Note that I have only added directories and files to your HOMEDIR. In
particular ~/.guix-profile/ ~/genenetwork, ~/opt and
~/defaultsettings.py. Nothing else was changed or overwritten.

Your old system and setup should still work fine - there is no
interference.

16 Getting a server running for testing

[2016-04-28 Thu] Testing is crucial. I am putting in a test framework for GN2 - I named
it 'Mechanical Rob' :).

A few years back Zach had put in a few Selenium tests and we will
continue to use Selenium to generate tests (it can generate code using
a browser plugin - I tried that and it is helpful because turning
clicks into code is pretty tedious). Unlike the previous edition by
Zach, the new tests, will be converted to run without Selenium
itself. I want to be able to run the tests on the server without a
browser (any time).

Dennis is interested in helping us to get full test coverage. I am not
sure how much he will do, but we have a good start. My aim is to cover
all functionality in GN2 with these tests.

I am adding dead-link detection in the testing framework. This is how
I find that GeneRIF link on the main page no longer works. I replaced
it with the wikipedia reference which, funnily, also contains the
broken link. Go, go NCBI! There are more dead links, however…

Next we need a server to test against. I got one started using the
merged trees on Penguin2 (currently running on http://penguin2:5003/
for those with VPN access.

This server should contain the latest changes by Zach, Danny and
myself. WGCNA is broken (it is a driver problem on Penguin2) as well
as the CTL functionality (I need to bring in Danny's CTL code). But
otherwise the thing should run. I'll fix these two problems. Please
notify me, or raise an issue, if you see any unexpected differences
between Zach's and my server.

One of the most intrusive changes is that the large files have moved
to a different directory and temporary files are now generated in
/tmp, rather than in the source tree (that is a no go!). This may
explain a missing figure or a failing export function. Again, do
notify me if you miss something that ought to be there. One figure that
is missing is the bar plot for interval mapping. I'll fix that too.

To get the server up and running (on top of Guix) I expanded the
README documentation.

17 A git merge nightmare

[2016-04-21 Thu] Working on three branches (Zach, Danny and myself) we
needed to merge our work. This proved less easy than normal, mostly
because we were working on essentially different repositories (after I
diverged the diet version described below). I tried merging,
cherry-picking and hand patching - it all proved too difficult because
we had been intrusively removing and renaming on both ends!

Since Zach's tree is still the authoritative one I decided to drop the
diet for now and bring in Danny and my changes by file (kind of a
rebase). Again, non-trivial as we all three worked on the same
files. The 'git am' command came in handily - another GNU Guix trick I
learnt. It was the most tedious job. I guess I learnt another lesson
here (even though we could merge, better don't diverge git source
trees). For a little while we'll use the sumo tree again.

I had to remove the first from .gitignore as it is now supposed to
be generic.

There are a few things to fix. R/qtl output is broken (seen that one
before) and CTL lacks the binary dependency. Also I need to tick off
gemma mapping and Zach's work.

18 Guix distribution of GN2

I spent significant time trying to get 'guix archive' to work, but
unfortunately there is a problem with R packages. So I am working
around that for Danny.

There are two options: (1) he builds himself and (2) I create multiple
archives. The first option is a no go - I think - because there are
some problems with the latest guix versions. Multiple archives will
work, but I need to recreate the symlink structure. I ought to try
Docker now (grumble) as it offers a unified environment, something
Danny can use.

Funnily (well not so funny) the guix builds are working again, so
Danny is doing that on his machine following these instructions.

19 Preparing GN2 for distribution

19.1 Removing large files out of the git repository

In my git repository I have removed the genotype data files (they
are now outside the source tree). This trimmed it down from 350Mb
to 30MB). We also need to rewrite git history to remove the large
files from the git database.

git filter-branch --tree-filter 'rm -rf genotype_files' HEAD

Next I found there was another copy of the directory in git history
web/genotypes using

Now GN2 runs completely from GNU Guix, the beast named GN2 has been
tamed! There will probably be more hickups and I still need to move
the javascript libraries to Guix, but at least we can move forward
from here.

21 Fixing GN2 file paths and a working GN2 server

Lots of file paths in GN2 are hard coded. To make the beast work and
move genotype files out of the git repo (reducing that from 350Mb to a
more manageable 30Mb).

I rejigged the (hard-coded) paths. A hickup was that GN2 generates
images and puts them in a folder inside the source repository. Scary!
Flask supports 'static' serving from subdirectory with the same
name. We don't want that, but we can rewire the path which is what I
do in this commit. Now, not only, do we avoid uploading images in the
source tree (one problem is that Guix won't allow it because it is
immutable) but also we have everything in temporary directory which is
easier to monitor.

I spend much of my time this week getting paths and dependencies
sorted. One ball-breaker was the piddle version I found on penguin.
Despite its version number piddle-1.0.15-py2.7 it turned out to be
different from the online version. Good grief, I am happy we are
running Guix now!

Anyway, as it stands, GN2 runs fine on Guix which means it will run
anywhere. pylmm, r-qtl and interval mapping work out of the box. I
still need to work on the plink and gemma wiring as much of it is
hard coded. We ought to move that out to the REST server anyway,
so that is next.

But at least we have something we can roll out.

22 Packaging Genenetwork (cont 5)

[2016-02-22 Mon] Now that pylmm is packaged the final GN2 dependency
is the database. MySQL allows for read-only databases these days, so
we can create a nice testing/development database that is embedded in
a GNU Guix package. To idea is to do a

guix package -i genenetwork2-database

which will install GN2 with the database and GN2 will fire up with
one single command

genenetwork2

against which we can run the test system and even do some integration
testing (as long as it does not want to write data).

As we are probably going to split up private (mutable) and public
(immutable) data in the future I think this is a brilliant way of
deploying public datasets for testing. Should work right out of the
box. Multiple mysql servers can even run on one single database, for
example.

The first step is to fetch a database from somewhere. We now have
http://files.genenetwork.org/raw_database/ for that purpose. Even the
small database is still too large (1.5GB), so we may want to trim it
down.

Then there are the flat files which we are going to move out of the
various directories they live in today. There are some .map and .bed
files living in a plinkgemma dir, for example, others are in the
source code git repo (on github). I'll consolidate these on
http://files.genenetwork.org/.

23 Packaging pylmm and sambamba

[2016-02-20 Sat] The current GN is running an old version of pylmm,
git@github.com:genenetwork/pylmm.git commit
b8a15885ed3701e079170d6a8bf69bb8d8349f9c (Importing multi-core version
of pylmm for GN2). So that got pylmm-multicore packaged in
guix-bioinformatics thanks to Nick allowing for a proper FOSS license.

[2016-02-21 Sun] Roel started work on packaging sambamba, a fast
performing NGS tool. To make it work I wrote a new sambamba makefile
for Guix which builds fine in my environment. Over to him again.

24 Expanding the GeneNetwork effort (the team is growing)

[2016-02-15 Mon] The story continues. Good news is that Roel and
Dennis are joining this week in the effort. And Harm.

Harm is working on a UI using Karl's D3 materials named the
'Arabidopsis eQTL analysis platform' or AraQTL, the Django source code
(an MVC) is here. Dennis started work on GPU related stuff and Roel is
packaging general bioinformatics software in GNU Guix.

To make it easier to collaborate we have created an IRC channel on
freenode. Go to irc.freenode.net and join with

26 Packaging Genenetwork (cont 4)

[2016-02-13 Sat] The qtlreaper I packaged a few weeks back. Today was
htmlgen's turn (a piece of software dating 1998!). The version that
comes with GN is slightly different from the standard version. In the
Guix repo the python files are now correctly compiled and I only
included the .pyc files we are using. Guix just makes me happy.

So, the good news is that all the GN2 code bootstraps using the new
./bin/genenetwork2 from Guix up to

Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock'

This error is because I am not yet running mysql. I am getting close,
really close.

Lei has opened a http accessible storage for large files. This means
large static files can be removed from the git repos and can be split
out in GNU Guix deployment. This includes genotype files, source
packages and even test databases. GNU Guix checksums these files, so
we can make sure they do not change (just like that). I do think we
should add the date and source to the file name, so it is clear where
it came from. E.g., these filenames are not so descriptive and they
are duplicates

27 Packaging Slurm

[2016-02-12 Fri] Today I packaged slurm, a resource manager for
clusters, and submitted the patch to GNU Guix for inclusion. The first step
towards using cluster effectively. It was a bit harder than I
expected, mostly because non-free software had to be removed and the
build system adapted.

28 Managing power of the cluster

[2016-02-11 Thu] Yesterday I got more of Victor's old boxes and it
looks like some are in bad shape and some can be rescued. I have one
node screaming behind me now and it has PXE, so we can boot over the
network. One thing I want to do is easy provisioning of new images to
make this a true throwaway GNU Guix cluster. A description of a simple
PXE server can be found here, required is DHCP plus tftp. When it gets
serious we should implement that. I also looked into wake-on-lan (WOL)
which should also work on these machines. It is good we have these
capabilities now because a running node pulls 200W. The powersupply
for two nodes idling pulls 30W. I think we also need a master switch
for ever 8 nodes (240W sleeping 1600W awake!). I am going to shut down
6 of the 8 existing nodes today. Waking them up is as easy as

wakeonlan hwaddr

To list the hwaddr (where machines.txt contains the output of ifconfig)

cat machines.txt |grep HWa|awk '{print $5}'

With the nodes sleeping they still pull 15W each. So that is
8x15W = 120W or about 2 KWh per day ~ EUR 0.40 p.d. or
EUR 12 per month.

When the nodes double that will be EUR 24 per month (sleeping, running is
10x as much! You can see the point of cloud computing).

I still think I need we need a 220V master switch which can be managed
remotely. So all cost drops to zero when the cluster is not in use.

Next time I am with the machines I need to configure the BIOS to
accept PXE. After that we can roll out the provisioning.

On the hacker space mailing list the following question came up
with my answer

Q: Would it be an idea to transform (some of) the nodes into a more
generalized tinkering platform - for instance I heard someone
mention botnets recently, and it would be a shame to rig a separate
server for that if one could just spool a VM to a cluster node.
Building a "general network tinkering and test deployment platform"
might be a nice project in itself.

A: The cluster is exactly meant for experiments and I am happy to share.

The concept is a throwaway setup. That means you should write your
installation in such a way that it can be regenerated every time. This
is done through GNU Guix scripts which allow for VMs, containers and
PXE installs on the nodes.

To run experiments needs some co-ordination with me and perhaps
others. Every software deployment and configuration should be captured
in Guix.

The idea is NOT to have ad hoc installation of software on
the main node (though you are free to mess within your VM) and also
NOT to have stuff running full-time.

Think of it as proof-of-concepts which can run for a while. I'll be
happy to facilitate with 8x nodes with 8 cores and soon another 8x8.

29 Back in NL and hacking GNU Guix

[2016-02-08 Mon] Returned a week ago and most time went into FOSDEM
submitting a grant proposal. My FOSDEM talk can be found here - with
slides and video.

This week ramping up GeneNetwork packaging again (after building and
testing); So GN is on latest Guix. Also pushing the munge and slurm
packages to Guix as well as updating the documentation including
impressive graph (I'll add one for GN when it is complete).

I am purging LFS (because it is github propriety) from the GN2 git
repo. I.e.

The rest of the day was spent on a regression of python2-sqlalchemy
(grrr). Turned out python2-sqlalchemy does not work with a more recent
version of sqlite, so I introduced an older version again. It is
amazing how many tools use sqlite. Anyway, the good news is that
the build system is moving incrementally forward.

30 Packaging Genenetwork (cont 3)

[2016-01-23 Sat] The GNU Guix GN2 installer is pretty much finished
though I don't have the bandwidth to test the project. Next up is the
GN1 installer. I dug up an E-mail with instructions from Lei which
points out Python 2.4.3 may be required as well as a bunch of modules
that are in a 3rd party tar ball: graphviz-2.22.2, htmlgen, json,
numarray-1.5.2, piddle, PIL (in guix), pp-1.5.7, pyx, pyXLWriter,
svg. I.e., a few more modules to package. At least all these with
source, because even a cursory look shows these modules are old and
HTMLgen, for example, has no download location any longer. Since we
are not going to write software for GN1, I will just use the tar ball
with an installer script. The json, pp and numarray modules may be
worth adding to GNU Guix. Anyway, it looks all doable now. Today I
packaged python2-numarray as a GNU Guix package.

31 Packaging Genenetwork (cont 2)

[2016-01-22 Fri] The latest GNU Guix installer copies the full webserver
code into the store

With that we have achieved clean versioning as a first step. Now I
need to make the dependencies available for the webserver which is
somewhat tedious. But once it is done it should last for a long
time. Luckily a lot of the packages already exist in GNU Guix -
sometimes it pays to wait in informatics ;). I think most of the
dependencies are in the package now. Looks like I need to package
another 4-5 myself. Hopefully they don't depend on too many modules
themselves (r-wcgna has a long list I still need to do! The new
bioconductor export module of GNU Guix will be a great help).

Today I added a GNU Guix package for qtlreaper. But looking into the
GN2 source code I don't think it is actually used. Likewise a bunch of
other modules.

32 Packaging Genenetwork (cont)

Today I managed to complete the python installer for GN2. This has led
to a request for handling genotype data outside the tree and looking
at contained packages not written by us. I also reported a bug in GNU
Guix related to fetching repositories from git and updated R-qtl to
the latest version in GNU Guix. I have started adding and testing
dependencies following Artems Docker recipe with included requirements for
the Python installation. Be interesting what kind of graph comes out!

33 Started packaging Genenetwork

Both Harm Nijveen and I are working on GeneNetwork installation. First
hickup is that the git repo for GN1 is not complete and the tar-ball
is 33GB. Meanwhile I am porting GN2 to GNU Guix. The first step was to
create a python installer because the .py files need to be compiled to
execute in a read-only store. The net effect is a GNU Guix install in

At least the python stuff looks complete, but the html and other files are
still missing.

34 Install Guix

To solve the R/lmmlite install and am opting for a GNU Guix package.
We are setting up an Arvados cluster based on GNU Guix and agreed that
all installation should be through Guix. Progress is recorded here.
I needed to do the base install anyway so Roel can continue later. Today
I got to the point I can experiment with R/lmmlite.

35 Install R/lmmlite

Working from Zanzibar the time has come to align my version of pylmm
with that of R/lmmlite by Prof. Karl Broman (lmmlite is an R/C++ hybrid
implementation of Nick's pylmm). It ought to be interesting to compare
output and performance of both tools and see how we can move forward
hosting one or more of these tools into GN. Speed and memory usage are
of interest when hosting such services.

For R/lmmlite the R libraries need to be installed and pulling over
100Mb for the R devtools and dependencies over narrow band is no
joy. I am on narrow band and the 'lite' part obviously does not
reflect the implicit dependencies. Karl sent me an update that I
should use 'R CMD', but that in turn requires knitr etc. This is stuff I
can only do at night, so as not to disrupt other users.

Next to the CUDA stuff I also did some work on the deploy tool. It
does some basic file copying now from YAML descriptions, the most
important part for deployment. I can use it now for configuring
emacs, vim, sshd etc. Simple file editing is the next step.

37 Installing penguin2

The new production GN server (Penguin2) is sweet. 56 2Ghz Intel Xeon
cores, 256Gb RAM and an NVIDIA K80m on board. Ready to R&R! At the
moment it is not running anything - just a barebones Ubuntu 14.04 with
Linux 3.19 kernel. I'll add CUDA (now where did I document that
again? Aha, it is in my old BLOG):

First step is to install the CUDA driver and software provided by
NVIDIA.

The toolkit (with driver!) is a 1.9Gb download and binary
install. Luckily this 1.9GB download link can be used from the server
directly. Initially, I installed the local .deb file

The UVM is used to share memory between the CPU and GPU in CUDA
programs. We won't need that for the time being.

To install the software I used screen (the installer is probably not
atomic).

Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-7.
Samples: Installed in /home/pjotr, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda/bin
- LD_LIBRARY_PATH includes /usr/local/cuda/lib64, or, add /usr/local/cuda/lib64 to /etc/ld.so.conf and run ldconfig as root
Logfile is /tmp/cuda_install_31036.log

The install did warn that a default install may fail and that it was
using unoptimized lapack. Running the following example crashed the
machine on one Tesla the first time and segfaulted on another. Great.

is correct. However, when we want to install the CUDA packages it
tries to install the more recent driver. WHAT NOW? I decided to use
the binary installer (rather than the .deb) of the OLDER 7.0 CUDA
version, to align with Penguin1. Note, however, that the newer CUDA
libraries are more efficient at floating point storage, so we may want
to upgrade when we start running out of CUDA RAM (sometime down the
line). After installation:

Driver: Installed
Toolkit: Installed in /usr/local/cuda-7.0
Samples: Installed in /home/pjotr
Please make sure that
- PATH includes /usr/local/cuda-7.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-7.0/lib64, or, add /usr/local/cuda-7.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-7.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Getting_Started_Guide_For_Linux.pdf in /usr/local/cuda-7.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /tmp/cuda_install_13279.log

which differs from the Guix version which is libc-2.22. Hmmmm. Major
HMMM. This does not look good. So, next I tried virtualenv with the
Debian/Ubuntu packages. And, oh (banging head on table) I forgot
to activate the virtualenv! That means I was mixing Pythons.

So I can work back towards using the GNU Guix version with the later
CUDA 7.5.

Since we are aiming for reproducible installation paths it will take a
bit of effort to install the server properly using GNU Guix, deploy
etc. I also intend to write a new tool that takes care of monitoring
running services. This has to happen in conjunction with the other
stuff we are doing (not least getting pylmm to work with human data).

Next to programming I always end up working on (software)
deployments. Maybe it is because few people are really interested. For
reproducible software it is key.

38 Installing an Arvados cluster

[2015-10-14 Wed] For GN we are going to use a cluster for back-end
computations. These weeks, to evaluate Arvados, I am building a ~180
core compute cluster. The installation process is documented here. The
cluster will be hosted in Groningen. The installation is mostly
automated. For development and testing of automation I use a KVM
virtual machine as described here.

39 R/qtl control, genotype and phenotype parsers added

[2015-10-02 Fri] Today I added control, genotype and phenotype parsers
for R/qtl support. R/qtl can add multiple phenotypes, so we need to
add a switch for selecting the phenotype (or are we going to run them
all by default?).

the F2 Iron set is also nice because it has a lot of missing data. See
how that comes out. In the process I am adding support for column
headers.

40 Teaching R/qtl to pylmm

Because of an increase in human data and more complex crosses, such as
the diversity outbred mice, there is a demand for using LMMs from
R/qtl. The current R/lme4 library could arguably be used, but it
misses the speed of tools like pylmm and FaST-LMM to crunch through
the data. So, where we recently introduced R/qtl in GN2, now we are
introducing pylmm for R/qtl (I rather enjoy inside-out use
cases). With Karl using pylmm also scrutiny of the results will get
another impetus - which is important.

The first step is to teach pylmm genotype and phenotype data formats
that can be emitted from R/qtl and described here.

41 A new page - starting on pylmm again

[2015-10-01 Thu] The coming months I will be working on pylmm again,
so time to restart this BLOG. Since my last BLOG we achieved a number
of things: First, the pylmm license has been changed to a true FOSS
license. Thanks Nick! Second, we got a new CUDA K80 server which
should allow for some LMM heavy lifting. Third, we got Arvados to work
and we are setting up a dedicated cluster as a pilot for backend
computations. Forth, GNU Guix is flying and reproducible software
installations are around the corner. Fifth, we are working on
embedding pylmm into R/qtl. And, sixth, we did a Google Summer of Code
project with LMMs.

There are still some outstanding TODOs for pylmm from the older
BLOG. In a nutshell pylmm/CUDA works, but needs tweaking for the human
data in GN2 (mostly issues around file size/transfer) and covariates
need to be passed in. This work matches the missing links for R/qtl to
use pylmm so it is clear what I should be working on.

But first we need to bring in the test system (currently sitting in a
separate github repository) and teach pylmm to read R/qtl
formats. I'll do that in parallel.

The outstanding pylmm issues on github matches the list from the
previous BLOG reads (in no particular order):