Principal component analysis (PCA) is a statistical procedure that can be used for exploratory data
analysis. PCA uses linear combinations of the original data (e.g. gene expression values) to define
a new set of unrelated variables (principal components). These new variables are orthogonal to each
other, avoiding redundant information.

Thus, PCA can be used to reduce the dimensions of a data set, allowing the description of data sets
and their variance with a reduced number of variables. Since similarities between data sets are
correlated to the distances in the projection of the space defined by the principal components,
PCA can also be used to identify outliers with respect to the principal components.

It is often sufficient to look at the first two components, as these describe the largest
variability.

What the program will do

This task can be used to get an impression on the similarity of RNA-sequencing samples, i.e. to identify subgroups or outliers.

The variance in RNA-Seq data usually grows with the expression mean. PCA on the matrix of normalized read counts will often lead to principal components that are dominated by the variance of a few highly expressed genes. DESeq2 regularized-logarithm transformation (rlog), transforms the data matrix of read counts per gene (or transcript) to log scale but specifically adopts for the high random noise of low count data. This is explained in detail on "RNA-Seq workflow: gene-level exploratory analysis and differential expression". The matrix of raw counts is input to the DESeq2 rlog function and the resulting transformed matrix is used as input for the principal component analysis (PCA, using the R package pcaMethods):

Rlog transformation is the default. Although not recommended, it is possible to do PCA directly on normalized expression values.
Based on the read distribution in the input files a normalized expression value (NE)
will be calculated for each locus (or transcript) for each input file.
The NE-value is based on the number of reads located in the exons of the locus/transcript
and is normalized to the length of the locus/transcript and the density of the data set.
The resulting matrix for NEs is then used as input for the principal component analysis as described.

Input data are accepted in
BED / bigBed file format or
BAM file format containing the input regions.
For some tasks BAM support might not be available.
The maximum amount of input regions and their maximum length can differ for the various tasks.
The limits are usually shown on top of the input pages.

Within this section you can either

choose from previously uploaded BED/BAM files

or add a new BED or BAM file to the list (by clicking "Add BED/BAM file...")

For those tasks that allow to choose replicate data as input, you can use shift/ctrl-keys to select multiple files
from the list. All selected files will then be treated as replicates.

When adding a new file, a new window will open, asking you to either

upload one or several BED/BAM files from your local computer

or import one or several BED/BAM files from the GMS (see more details)

or import one or several BED/BAM files from the GGA (see more details)

For the new BED/BAM files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build,
which can be changed by selecting a different ElDorado version on the top right of the page
before uploading a file.
You can see the list of genomes available in ElDorado.

Note that almost all browsers have a general upload limit of 2 GB,
i.e. files bigger than this size should be zipped before uploading from your local computer.
This restriction does not apply when using the direct import from the GGA/GMS.

Optionally you can specify a name for saving uploaded files on the server,
otherwise the name of the uploaded file will be used.
If several files are uploaded, the string given here will be used as prefix for each file name.

If any of the regions in the input file cannot be completely assigned to the selected genome
(e.g. wrong chromosome numbering or wrong positions within a chromosome),
an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file,
the complete file will be skipped.

After one or several BED/BAM files were uploaded successfully, and after closing the popup window,
the list of available BED/BAM files will be automatically updated.

Uploaded BED or BAM files can be deleted from the project anytime via the
project management.

Options

rlog Transformation:
Transform the count data matrix by the DESeq2 rlog function [default]
Alternatively the normalized expression values (NEs) are used for PCA.

Parameters for PCA

Number of Groups:
Here you can select the number of groups for your samples
(e.g. the 3 groups control, treatment 1, and treatment 2), with a maximum of 13 groups.

Group properties:
A box will appear for each group: here, you can use drag and drop
to assign the samples from the available-files-list above to your groups.
You can also edit the group names by clicking on the little pencil icon,
and select the color that will be used for the specified group in the output graphics.

Transcript/Locus

The expression analysis can be based on different units of underlying data:

Locus-based expression analysis:
The exons of all transcripts with the same GeneID within a Genomatix locus are taken
together and this "gene body" is used for counting reads
(i.e. reads in overlapping exons of transcripts within the same locus are counted once)

Transcript-based expression analysis:
All transcripts are considered separately when counting reads in exons
(and reads within overlapping transcripts/exons might be counted several times)

If the transcript-based expression analysis is checked, the transcripts used for expression analysis
can additionally be constrained by their source (e.g. NCBI RefSeq).
By default, all non-redundant transcripts available in ElDorado are used.
Depending on the organism, several transcript sources are available.
For example, human and mouse transcripts are available from

NCBI RefSeq

Ensembl

NCBI GenBank

For plants, additional sources may be available (e.g. Phytozome for Glycine max).

An email with the URL of the results will be sent
to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server.
For details of how long your results will be kept please see the result-email.
After that period they will be deleted unless protected in the project management!

The score plot displays each sample in the data set with respect to the first
two principal components and can therefore be used to interpret the relations among the samples.
This information can be used to identify outliers.

In the example below, the replicate samples show a high similarity with respect to the first two principal components,
a small within group variance and a good separation of groups.

The scree plot visualizes which principal components account for which fraction of
total variance in the data. The principal components are listed by decreasing order
of contribution to the total variance. The bars show the proportion of variance represented
by each component (R2) and the points shows the cumulative variance (R2cum).

The loadings plot is a plot of the relationship between original variables (genes)
and subspace dimensions. It allows the identification of genes that are most strongly
correlated or anti-correlated with the first two principal components.