README.md

GiniClust

GiniClust is a clustering method implemented in Python and R for detecting rare cell-types from large-scale single-cell gene expression data.

GiniClust can be applied to datasets originating from different platforms, such as multiplex qPCR data, traditional single-cell RNAseq or newly emerging UMI-based single-cell RNAseq, e.g. inDrops and Drop-seq.

GiniClust is created and maintained by the GC Yuan Lab at Harvard University and the Dana-Farber Cancer Institute and comes with a graphical user interface for convenience:

Installation

Please ensure that you have Python 2.7 in your environment. The graphical user interface of GiniClust relies on wxPython, a Python wrapper for the cross-platform wxWidgets API. Instructions on how to install wxPython are available on the corresponding website. On Fedora Linux, the following at the command-line interface worked just fine:

$ sudo dnf install wxPython.

In addition, GiniClust relies on the following libraries:

Gooey(version 0.9.2.3 or ulterior);

setuptools (version 24.0.2 or subsequent).

Those packages should be automatically installed or upgraded via a pip installation. For instance, to install Gooey, proceed as follows:

start a terminal session;

run $ pip install Gooey --upgrade.

If in doubt, please check that those libraries got installed properly by trying to import them or some of their modules in your Python interpreter: >>> import gooey, pkg_resources.

As for the R code at the core of much of GiniClust`s computations, for MAC and WINDOWS only the official R installation file is supported and tested. Using other installation methods, such as brew, may lead to running error.

Besides, some users might experience issues installing another of GiniClust's dependencies: the MAST R package. If this happens, please visit the MAST website (https://github.com/RGLab/MAST) for detailed instructions. We recommend that users upgrade MAST package to the newest version. If you are using an old version, you may need to replace the file DE_MAST.R in 'Rfunction' by DE_MAST.R in 'Archive'.

Input file format

The input file is a gene expression matrix in comma-separated value (csv) format.

Specifically, for qPCR data, each row is log2 gene expression level; for RNAseq data, each row is UMI-Count/Cell or Raw-Read-Count/Cell (Note: log2 transformed RNA-seq data for Giniclust may not work! We suggest that user use featureCounts from http://subread.sourceforge.net/ or htseq-count from http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html to get raw reads counts ). The first row contains cell IDs. The first column contains unique gene names.

you can take a look at one of our test datasets (stored in the sample_data folder within GiniClust's repository):

Table

MGH26

MGH26.1

MGH26.2

MGH26.3

1/2-SBSRNA4

0

47

0

0

A1BG

41

80

3

0

A1BG-AS1

0

0

0

0

A1CF

0

0

0

0

Usage

To run GiniClust, please download the GiniClust GitHub repository, unzip it and move to the extracted directory so that it becomes your current working directory.

Then, in a Linux environment, proceed as follows:

start a terminal session;

enter $ python GiniClust.py.

From an OS X or Windows environment, proceed as follows:

launch a terminal session;

enter $ pythonw GiniClust.py.

A graphical user interface will spring up and direct you into choosing a file to process from your arborescence of directories, specify the type of data at hand (qPCR or RNA-seq), along with the name of the folder where you would like to store GiniClust's output (see the section below for more information about those files). A screenshot is provided herewith:

Alternatively, GiniClust can be run directly as an R script at the command-line interface:

$ Rscript Giniclust_Main.R [options]

You can specify the following options:

-f CHARACTER or --file=CHARACTER, input dataset file name

-t CHARACTER or --type=CHARACTER, input dataset type: choose from 'qPCR' or 'RNA-seq'

Dataname_RareCluster_overlapgene_rawCounts_bar_plot.genename.pdf: barplot of rare cluster and major cluster for the overlap genes

Furthermore, a folder named 'Library' will be created, which includes a wealth of newly installed packages.

Reference

The GiniClust software was developped in support of a research project conducted at the GC Yuan Lab (Harvard University & DFCI). If you find it useful to your own investigations, please cite the following publication:

Credits

Lan Jiang (lan_jiang at hms dot harvard dot edu), the main developer of GiniClust, wrote the R scripts and started the README file. Gregory Giecold (ggiecold at jimmy dot harvard dot edu) developed the graphical user interface, reorganized the R packaging and edited the README file. Huidong Chen (hdchen at jimmy dot harvard dot edu) wrote the R command-line interface and contributed to the graphical user interface. Qian Zhu (qzhu at princeton dot edu) contributed to the graphical user interface. We would like to give special thanks to Luca Pinello who introduced the Gini-index and advised on the development and implementation of the GiniClust software.