Analysing the sample

Once a sample has been constructed it can be analysed. ICECUP includes
a powerful statistical ‘knowledge discovery’ tool which
can explore many combinations of variables quietly in the background.
The tool generates independent hypotheses about the sample which
are sent to a new ‘hypothesis panel’ in the sample viewer.

Every hypothesis is tested for statistical significance. If it
is significant it is then scored according to a number of factors
(below). The best hypotheses are then reported.

In this example, ICECUP has found two hypotheses which show

that the transitivity of the verb phrase has an impact on the
dependent variable, and

that, in particular, copular and ‘trans’ cases are
reliable predictors for the form being relative.

The following statistics are reported. Hypotheses are rated for
utility, which is calculated as a combination of
four factors: coverage, fitness, accuracy
and swing. Once we have a measure of which hypotheses
are “better” than others, the discovery algorithm can
prioritise. More complicated hypotheses are considered only if they
improve on a less complex one.

label

summary

explanation

+ve

true positives

number of cases correctly predicted by the hypothesis

-ve

false positives

number of cases incorrectly predicted by hypothesis

coverage

proportion covered

cases covered by hypothesis / total cases

fitness

inverse accuracy

positive examples / cases in target value

accuracy

accuracy

positive examples / cases covered by hypothesis

swing

accuracy improvement

accuracy - cases in target / total cases (scaled)

utility

composite measure

weighted product of coverage x fitness x… swing

What
the hypothesis statistics mean

ICECUP can help you evaluate these hypotheses in terms of the cases
they cover in the corpus.