Welcome to Cloe

Cloe (pronounced like the name Chloë) is a computational biology tool to infer
the clonal structure of heterogeneous tumour samples. It implements a
phylogenetic latent feature model that discovers hierarchically-related
patterns (clonal genotypes) in the samples, and with these describes the
observed mutation data.

News

Version 1.0 (2018-11-22)

A Chinese Restaurant Process clusters mutations with an infinite mixture
of binomial distributions, and automatically identifies how many clusters
are needed. Do check the resulting clusters before proceeding, to ensure
that different biological signals have not been placed in the same
cluster. This step works best if you have multiple samples.

The code for clustering is written in C++11 with RcppArmadillo.

Optimisation of clonal fractions

Clonal fractions are no longer sampled by the MCMCMC sampler, but
optimised (limSolve::lsei) given the data and the current genotypes.
This is faster and helps mixing.

Updated tree updates

The tree is now updated by Gibbs sampling and with a prune-regraft step.
Gibbs sampling goes through each node k and looks for a new parent
among all nodes outside of k's subtree. Prune-regraft is a joint update
of tree, genotypes (and fractions): genotypes of the moved subtree are
updated so as to fit with the new parent; fractions are optimised given
the new genotypes.

Genotypes updated a random portion at a time.

Because genotypes and fractions keep each other in place during
inference, a smaller genotypes update is performed, taking a random
portion of mutations each time.

Added AIC and WAIC for model selection

Simpler ISA

Parallel mutations are defined as mutations (the current clone has the
mutation, its parent does not) that occur despite having already appeared
in the tree before. A previously seen mutation happens with a modified
probability mu * nu, where nu is the ISA penalty, instead of mu.

cowplot is now used for all plots

Clones have been renamed

The normal clone is now called N (instead of C1), while the first
non-normal clone is now C1 (instead of C2).

Classes have changed somewhat

All three classes have changed a bit to cope with the novelties.

Leaner code

Thanks to Jack Kuipers for useful discussions on some of these updates.

Requirements

Cloe has been developed with R >= 3.2.1. It has been tested on Linux (Debian
stable) and Mac OS X (10.8.5 and later).

In the optional, but recommended, step 1.5, you can cluster mutations with a
Chinese Restaurant Process. Plot the resulting object to ensure that the
clustering has not mixed different biological signals into the same cluster. If
that happened, rerun crp with a larger value of alpha (see ?crp for more
information).

In step 2, the sampler runs our MCMCMC algorithm using the number of clones K
that you specify. If you do not know how many clones are present in the data,
you should run the sampler for several likely values, and select "the best
model" in step 4.

By default Cloe runs 4 parallel tempered chains. You can change this behaviour
by specifying how many chains you wish and their temperatures (e.g.chains=2, temperatures=c(1, 0.9)). There is no point in running multiple
parallel chains if they do not swap their states efficiently and throughout the
run. To check that all went smoothly, plot the cloe_mcmc object returned by sampler(). If some chains are not swapping, reduce the temperature intervals
between them.

The summarise function of step 3 discards iterations at the beginning of the
chain with the burn option (it takes a proportion of the iterations, e.g.burn=0.5 discards the first half of the chain), it thins the chain taking
every i^th iteration with thin=i, and returns a number of solutions sorted by
decreasing log-posterior probability.

Note: you can plot all of Cloe's classes, and plots are automatically
written to disk. This behaviour may change in the future.

Model selection

select_model returns a list of cloe_summary objects sorted by the chosen
criterion (see ?select_model for more information). The model selection plots
show the log-likelihood, log-posterior, AIC and WAIC. You would want to choose
the simplest model that best explains the data. As proxies for this, look for
high log-posterior and log-likelihood values, and low AIC and WAIC.