How to use this tool?

Bulk segregant analysis (BSA) coupled to high throughput sequencing
is a powerful method to map genomic regions related with phenotypes
of interest. It relies on crossing two parents, one inferior and one
superior for a trait of interest. Segregants displaying the trait of
the superior parent are pooled, the DNA extracted and sequenced.
Genomic regions linked to the trait of interest are identified by
searching the pool for overrepresented alleles that normally originate
from the superior parent. BSA data analysis is non-trivial due to
sequencing, alignment and screening errors.

Create an account

First, you can create your acount or login using the buttons on the
top right corner:

If you do not want to create an account, don't worry, it is not required.
But you will need to store the results URL to be able to access them again.

Create a new Experiment

Step 1: Experiment Information

After cliking the "New Experiment" button on the main page, a form will be displayed
to add the information of the BSA experiment. A name, a description,
the recombination rate of the organism and the number of segregants:

Recombination Rate: The known or expected rate of recombination in the chromosomes (e.g. 0.0000035 for Yeast)

No. of Segregants: The number of segregants sequenced on the Pool (e.g. 30 for the test data)

After adding the experiment click the "Next" button to continue to the next step.

Step 2: Upload count data

Variant Call Format (vcf):

You can upload a standard .vcf file and enter the name of the sample that want to analysed. If the name is empty, the first sample will be analysed.

In the test .vcf there is only one sample (i.e. Pool1), but it could contain information about more samples (e.g. parent genotyping or other pools data).
If your have a .vcf file with information from multiple experiments, this fields becomes useful.

Tab Delimited Format:

You can upload a tab delimited file with the following format:

ChromosomepositiontotalCountaltCount

You can see an example downloading this file. We normally use alt counts based on the
assumption that that the reference used for the DNA mapping is the inferior parent. Therefore, alt counts are the supperior parent counts
at the marker sites.

Step 3: Select EXPLoRA parameters

EXPLoRA is a Hidden Markov Model (HMM) that models the effects of linkage disequilibrium
to explain the dependencies between neighboring variant frequencies.
EXPLoRA models the relationship between a genomic variant and the phenotype of interest as
a hidden state and use beta-binomial distributions to calculate emission probabilities.

A: marker sites are modeled to be in a neutral state (N-state, blue circles) or in a state of being linked to the
phenotype of interest (P-state, orange circles) based on its relative variant frequency in the pool of segregants.
B: emission probabilities for the neutral (blue curve) and the phenotype-linked states (orange line) as a
function of the variant frequencies, modeled by a beta-binomial distribution with parameters α and β.
C: transition probability as a function of the physical distance between neighboring marker sites.

In this step you need to select different values to the parameters of the beta-distributions that model the emission probabilities.
As default parameters we use three different values: α= 5, β=1 is meant to have high sensitivity.α=10, β=1 is meant to be a middle ground between sensitivity and specificity.α=30, β=1 is meant to have high specificty.

The ratio between alfa and beta: You can run the tool for different ratio's. The higher alfa becomes, the more stringent the
search i.e.
the more the effect of the QTL needs to be (the more overrepresented the number of superior allele counts versus the total number of
allele counts at a marker site. Regions that will be detected with the most stringent setting will also be detected with
less stringent settings. The more stringent the setting at which a region can still be detected, the more reliably the QTL association
is. Note also that increasing the stringency of the setting allows to decrease the region that associates: because of LD,
the number of markers that correlate with the truly associating marker site decreases with the distance from the truly associating
site. The more stringent the setting the less 'correlating markers' still meet the criteria to be selected.

Additional parameters can be added. The parameters can be easily modified by moving the horizontal sliding bars.

Changing the parameters allow to see how they relate to the background frequency distribution (shown in Black)

You should select parameters so that the mass of the distribution representing the neutral emission probability is leftsided from the mass
represented by the distribution describing the emission probabilities in the phenotype linked sites.
The smaller the intersection between the neutral and the phenotype linked emission probability distributions, the more stringent the selection criteria.
Check these simulations results (here and
here) for more information about the performance
of the methods with different parameters and scenarios.

Alpha Field

Beta Field

Alpha Field

Beta Field

Alpha Field

Beta Field

Undesrtanding the parameters

EXPLoRA models for each marker site, two possible states: one state (P-state) expresses that the variants in the pool at
that marker site originate predominantly (but not always in all segregants) from the superior parent and are thus linked
to the phenotype of interest. A second state (N-state) models that the variants in the pool at a given marker site originate
to an equal extent from either parent, in which case the marker site is assumed to be located in a neutral region not linked
to the phenotype of interest.

The effect of linkage disequilibrium is modeled by the transition probabilities t between two neighboring marker sites.
The transition probability t models the chance that a neighboring site remains in the same state as its preceding site state.
Its distribution is described by a negative exponential model as a function of the recombination rate r and the physical distance
between neighboring marker sites. The probability to change states upon transition from one marker site to its direct neighboring
marker site (from a neutral N-state to a phenotype-linked P-state or vice versa) is then described by 1-t and takes into account
the true distance between them (i.e. no distance binning is involved). The model captures the fact that marker sites located in
each other's physical neighborhood are likely to be in linkage disequilibrium and less likely to change their state (from P to N or from N to P).
Optimally the parameter r should approximate the true recombination rate in the organism of interest. When r is overestimated marker
phenotype associations are treated indepdently and the effect of exploiting LD to increase the power of the analysis is lost.

Each state in the model emits a random variable nA, corresponding to the number of variant counts at a given marker site originating
from the superior parent. nA ranges from 0 to n, with n being equal to the (known) total variant count for the marker site. nA is
described by a beta binomial distribution, which allows capturing different emission probabilities in phenotype linked versus neutral
states by choosing different α and β parameters for their corresponding distributions. We modeled all neutral states with
the parameters αN and βN, and all phenotype-linked states with the parameters αP and βP. While for the neutral states αN should almost
equal βN to make values of nA closer to n/2 more likely to be sampled, for the phenotype-linked states αP should be much larger than βP
to make values of nA close to n more likely to be sampled. In the application the parameters αN and βN of the emission probabilities
in the neutral state can be estimated directly from the observed variant frequencies as described in Pulido et al. The ratio between αP
and βP defines the degree to which the relative variant frequency at a marker site needs to differ from the one obtained through
random inheritance for it to be called linked to the phenotype (stringency of the method). Changing the ratio affects the probability
with which an observed relative variant frequency is interpreted by the model as a phenotype linked region. In the application
the ratio between αP and βP can be altered by fixing βP equal to 1 and testing different values of αP.

After the parameters have been selected, submit the experiment for processing.

Results

The results of EXPLoRA is shown as a posterior probability of being a phenotype-linked marked. The results are presented
for each set of parameters (α β ratio) selected in Step 3 in a different color.

Notice that for parameter settings too stringent the results will be a posterior probability of zero in all marker sites.
If this is happening in your sample, please use the Edit Experiment button where new sets of more relaxed
parameters can be selected.

The figure show in the X-axis the position in the chrmosome and in the Y-axis the posterior probability.

The results can be downloaded in two different formats:

Marker site information: as a coma separated (.csv) file
listing the posterior distributions per marker and parameter setting

Detected regions: a BED
file showing the regions linked to a phenotype detected.
The BED file can easily be used to further analysis by importing it into other tools like the
Integrative Genomic Viewer