5.1.3 Functions

Function: anova-one-way-variables&rest ARGS

ANOVA-ONE-WAY-VARIABLES (IV DV &OPTIONAL (SCHEFFE-TESTS-P T)
CONFIDENCE-INTERVALS)
Performs a one-way analysis of variance (ANOVA) on the input data, which
should be two equal-length sequences: ‘iv’ is the independent variable,
represented as a sequence of categories or group identifiers, and ‘dv’ is the
dependent variable, represented as a sequence of numbers. The ‘iv’ variable
must be “sorted,” meaning that AAABBCCCCCDDDD is okay but ABCDABCDABDCDC is
not, where A, B, C and D are group identifiers. Furthermore, each group should
consist of at least 2 elements.

The significance of the result indicates that the group means are not all equal;
that is, at least two of the groups have significantly different means. If
there were only two groups, this would be semantically equivalent to an
unmatched, two-tailed t-test, so you can think of the one-way ANOVA as a
multi-group, two-tailed t-test.

This function returns five values: 1. an ANOVA table; 2. a list a group
means; 3. either a Scheffe table or nil depending on ‘scheffe-tests-p’; and
4. an alternate value for SST. 5. a list of confidence intervals in the form
‘(,mean ,lower ,upper) for each group, if ‘confidence-intervals’ is a number
between zero and one, giving the kind of confidence interval, such as 0.9. The
fourth value is only interesting if you think there are numerical accuracy
problems; it should be approximately equal to the SST value in the ANOVA
table. This function differs from ‘anova-one-way-groups’ only in its input
representation. See the manual for more information.

ANOVA-TWO-WAY-VARIABLES (DV IV1 IV2)
Calculates the analysis of variance when there are two factors that may
affect the dependent variable, specifically ‘iv1’ and ‘iv2.’ Unlike the one-way
ANOVA, there are mathematical difficulties with the two-way ANOVA if there are
unequal cell sizes; therefore, we require all cells to be the same size; that
is, the same number of values (of the dependent variable) for each combination
of the independent factors.

The result of the analysis is an anova-table, as described in the manual. This
function differs from ‘anova-two-way-groups’ only in its input representation.
See the manual for further discussion of analysis of variance.
The row effect is ‘iv1’ and the column effect is ‘iv2.’

ANOVA-TWO-WAY-VARIABLES-UNEQUAL-CELL-SIZES (IV1 IV2 DV)
Calculates the analysis of variance when there are two factors that may
affect the dependent variable, specifically ‘iv1’ and ‘iv2.’

Unlike the one-way ANOVA, there are mathematical difficulties with the two-way
ANOVA if there are unequal cell sizes. This function differs fron the standard
two-anova by (1) the use of cell means as single scores, (2) the division of
squared quantities by the number of cell means contributing to the quantity
that is squared and (3) the multiplication of a "sum of squares" by the harmonic
mean of the sample sizes.

The result of the analysis is an anova-table, as described in the manual.
See the manual for further discussion of analysis of
variance. The row effect is ‘iv1’ and the
column effect is ‘iv2.’

AUTOCORRELATION (SAMPLE MAX-LAG &OPTIONAL (MIN-LAG 0))
Autocorrelation is merely a cross-correlation between a sample and itself.
This function returns a list of correlations, where the i’th element is the
correlation of the sample with the sample starting at ‘i.’

Suppose an event occurs with probability ‘p’ per trial. This function
computes the probability of ‘k’ or more events occurring in ‘n’ trials. Note
that this is the complement of the usual definition of cdf. This function
approximates the actual computation using the incomplete beta function, but is
preferable for large ‘n’ (greater than a dozen or so) because it avoids
summing many tiny floating-point numbers.

Returns the binomial coefficient, ‘n’ choose ‘k,’ as an integer. The result
may not be exactly correct, since the computation is done with logarithms. The
result is rounded to an integer. The implementation follows Numerical Recipes
in C, section 6.1

Returns the probability of ‘k’ successes in ‘n’ trials, where at each trial
the probability of success is ‘p.’ This function uses floating-point
approximations, and so is computationally efficient but not necessarily exact.

Computes the complement of the cumulative distribution function for a
Chi-square random variable with ‘dof’ degrees of freedom evaluated at ‘x.’ The
result is the probability that the observed chi-square for a correct model
should be greater than ‘x.’ The implementation follows Numerical Recipes in C,
section 6.2. Small values suggest that the null hypothesis should be rejected;
in other words, this computes the significance of ‘x.’

CONFIDENCE-INTERVAL-PROPORTION (X N CONFIDENCE)
Suppose we have a sample of ‘n’ things and ‘x’ of them are “successes.” We
can estimate the population proportion of successes as x/n; call it ‘p-hat.’
This function computes the estimate and a confidence interval on it. This
function is not appropriate for small samples with p-hat far from 1/2: ‘x’
should be at least 5, and so should ‘n’-‘x.’ This function returns three values:
p-hat, and the lower and upper bounds of the confidence interval. ‘Confidence’
should be a number between 0 and 1, exclusive.

CONFIDENCE-INTERVAL-T (DATA CONFIDENCE)
Suppose you have a sample of 10 numbers and you want to compute a 90 percent
confidence interval on the population mean. This function is the one to use.
This function uses the t-distribution, and so it is appropriate for small sample
sizes. It can also be used for large sample sizes, but the function
‘confidence-interval-z’ may be computationally faster. It returns three values:
the mean and the lower and upper bound of the confidence interval. True, only
two numbers are necessary, but the confidence intervals of other statistics may
be asymmetrical and these values would be consistent with those confidence
intervals. ‘Sample’ should be a sequence of numbers. ‘Confidence’ should be a
number between 0 and 1, exclusive.

This function is just like ‘confidence-interval-t,’ except that instead of
its arguments being the actual data, it takes the following summary statistics:
‘mean,’ which is the estimator of some t-distributed parameter; ‘dof,’ which is
the number of degrees of freedom in estimating the mean; and the
‘standard-error’ of the estimator. In general, ‘mean’ is a point estimator of
the mean of a t-distribution, which may be the slope parameter of a regression,
the difference between two means, or other practical t-distributions.
‘Confidence’ should be a number between 0 and 1, exclusive.

CONFIDENCE-INTERVAL-Z (DATA CONFIDENCE)
Suppose you have a sample of 50 numbers and you want to compute a 90 percent
confidence interval on the population mean. This function is the one to use.
Note that it makes the assumption that the sampling distribution is normal, so
it’s inappropriate for small sample sizes. Use confidence-interval-t instead.
It returns three values: the mean and the lower and upper bound of the
confidence interval. True, only two numbers are necessary, but the confidence
intervals of other statistics may be asymmetrical and these values would be
consistent with those confidence intervals. This function handles 90, 95 and 99
percent confidence intervals as special cases, so those will be quite fast.
‘Sample’ should be a sequence of numbers. ‘Confidence’ should be a number
between 0 and 1, exclusive.

Computes the correlation of two variables given summary statistics of the
variables. All of these arguments are summed over the variable: ‘x’ is the sum
of the x’s, ‘x2’ is the sum of the squares of the x’s, and ‘xy’ is the sum of
the cross-products, which is also known as the inner product of the variables x
and y. Of course, ‘n’ is the number of data values in each variable.

COVARIANCE (SAMPLE1 SAMPLE2 &KEY START1 END1 START2 END2)
Computes the covariance of two samples, which should be equal-length
sequences of numbers. Covariance is the inner product of differences between
sample elements and their sample means. For more information, see the manual.

CROSS-CORRELATION (SEQUENCE1 SEQUENCE2 MAX-LAG &OPTIONAL (MIN-LAG 0))
Returns a list of the correlation coefficients for all lags from
‘min-lag’ to ‘max-lag,’ inclusive, where the ‘i’th list element is the
correlation of the first (length-of-sequence1 - i) elements of
sequence1 with with the last i elements of sequence2. Both sequences
should be sequences of numbers and of equal length.

D-TEST (SAMPLE-1 SAMPLE-2 TAILS &KEY (TIMES 1000) (H0MEAN 0))
Two-sample test for difference in means. Competes with the unmatched,
two-sample t-test. Each sample should be a sequence of numbers. We calculate
the mean of ‘sample-1’ minus the mean of ‘sample-2’; call that D. Under the null
hypothesis, D is zero. There are three possible alternative hypotheses: D is
positive, D is negative, and D is either, and they are selected by the ‘tails’
parameter, which must be :positive, :negative, or :both, respectively. We count
the number of chance occurrences of D in the desired rejection region, and
return the estimated probability.

DATA-LENGTH (DATA &KEY START END KEY)
Returns the number of data values in ‘data.’ Essentially, this is the Common
Lisp ‘length’ function, except it handles sequences where there is a ‘start’ or
‘end’ parameter. The ‘key’ parameter is ignored.

This function computes the complement of the error function, “erfc(x),”
defined as 1-erf(x). See the documentation for ‘error-function’ for a more
complete definition and description. Essentially, this function on z/sqrt2
returns the two-tailed significance of z in a standard Gaussian distribution.

This function implements the function that Numerical Recipes in C calls erfcc,
see section 6.3; that is, it’s the one using the Chebyshev approximation, since
that is the one they call from their statistical functions. It is quick to
compute and has fractional error everywhere less than 1.2x10^\{-7\}.

This function occurs in the statistical test of whether two observed samples
have the same variance. A certain statistic, F, essentially the ratio of the
observed dispersion of the first sample to that of the second one, is
calculated. This function computes the tail areas of the null hypothesis: that
the variances of the numerator and denominator are equal. It can be used for
either a one-tailed or two-tailed test. The default is two-tailed, but
one-tailed can be computed by setting the optional argument ‘one-tailed-p’ to
true.

For a two-tailed test, this function computes the probability that F would be as
different from 1.0 (larger or smaller) as it is, if the null hypothesis is
true.

For a one-tailed test, this function computes the probability that F would be as
LARGE as it is if the first sample’s underlying distribution actually has
SMALLER variance that the second’s, where ‘numerator-dof’ and ‘denominator-dof’
is the number of degrees of freedom in the numerator sample and the denominator
sample. In other words, this computes the significance level at which the
hypothesis “the numerator sample has smaller variance than the denominator
sample” can be rejected.

A small numerical value implies a very significant rejection.

The ‘f-statistic’ must be a non-negative floating-point number. The degrees of
freedom arguments must be positive integers. The ‘one-tailed-p’ argument is
treated as a boolean.

This implementation follows Numerical Recipes in C, section 6.3 and the ‘ftest’
function in section 13.4. Some of the documentation is also drawn from the
section 6.3, since I couldn’t improve on their explanation.

Returns the factorial of ‘n,’ which should be a non-negative integer. The
result will returned as a floating-point number, single-float if possible,
otherwise double-float. If it is returned as a double-float, it won’t
necessarily be integral, since the actual computation is

(exp (gamma-ln (1+ n)))

Implementation is loosely based on Numerical Recipes in C, section 6.1. On the
TI Explorer, the largest argument that won’t cause a floating overflow is 170.

Returns the factorial of ‘n,’ which should be an integer. The result will
returned as an integer or bignum. This implementation is exact, but is more
computationally expensive than ‘factorial,’ which is to be preferred.

Returns the natural logarithm of the Gamma function evaluated at ‘x.’
Mathematically, the Gamma function is defined to be the integral from 0 to
Infinity of t^x exp(-t) dt. The implementation is copied, with extensions for
the reflection formula, from Numerical Recipes in C, section 6.1. The argument
‘x’ must be positive. Full accuracy is obtained for x>1. For x<1, the
reflection formula is used. The computation is done using double-floats, and
the result is a double-float.

Computes the cumulative distribution function for a Gaussian random variable
(defaults: mean=0.0, s.d.=1.0) evaluated at ‘x.’ The result is the probability
of getting a random number less than or equal to ‘x,’ from the given Gaussian
distribution.

Computes the significance of ‘x’ in a Gaussian distribution with mean=‘mean’
(default 0.0) and standard deviation=‘sd’ (default 1.0); that is, it returns
the area which farther from the mean than ‘x’ is.

The null hypothesis is roughly that ‘x’ is zero; you must specify your
alternative hypothesis (H1) via the ‘tails’ parameter, which must be :both,
:positive or :negative. The first corresponds to a two-tailed test: H1 is that
‘x’ is not zero, but you are not specifying a direction. If the parameter is
:positive, H1 is that ‘x’ is positive, and similarly for :negative.

INTERQUARTILE-RANGE (DATA)
The interquartile range is similar to the variance of a sample because both
are statistics that measure out “spread out” a sample is. The interquartile
range is the difference between the 3/4 quantile (the upper quartile) and the
1/4 quantile (the lower quartile).

Returns the correlations of ‘sequence1’ with ‘sequence2’ after
shifting ‘sequence1’ by ‘lag’. This means that for all n, element n
of ‘sequence1’ is paired with element n+‘lag’ of ‘sequence2’, where
both of those elements exist.

Calculates the main statistics of a linear regression: the slope and
intercept of the line, the coefficient of determination, also known as r-square,
the standard error of the slope, and the p-value for the regression. This
function takes two equal-length sequences of raw data. Note that the dependent
variable, as always, comes first in the argument list.

You should first look at your data with a scatter plot to see if a linear model
is plausible. See the manual for a fuller explanation of linear regression
statistics.

Calculates the main statistics of a linear regression: the slope and
intercept of the line, the coefficient of determination, also known as r-square,
the standard error of the slope, and the p-value for the regression. This
function differs from ‘linear-regression-brief’ in that it takes summary
variables: ‘x’ and ‘y’ are the sums of the independent variable and dependent
variables, respectively; ‘x2’ and ‘y2’ are the sums of the squares of the
independent variable and dependent variables, respectively; and ‘xy’ is the sum
of the products of the independent and dependent variables.

You should first look at your data with a scatter plot to see if a linear model
is plausible. See the manual for a fuller explanation of linear regression
statistics.

Calculates the slope and intercept of the regression line. This function
differs from ‘linear-regression-minimal’ in that it takes summary statistics:
‘x’ and ‘y’ are the sums of the independent variable and dependent variables,
respectively; ‘x2’ and ‘y2’ are the sums of the squares of the independent
variable and dependent variables, respectively; and ‘xy’ is the sum of the
products of the independent and dependent variables.

You should first look at your data with a scatter plot to see if a linear model
is plausible. See the manual for a fuller explanation of linear regression
statistics.

Calculates almost every statistic of a linear regression: the slope and
intercept of the line, the standard error on each, the correlation coefficient,
the coefficient of determination, also known as r-square, and an ANOVA table as
described in the manual.

This function takes two equal-length sequences of raw data. Note that the
dependent variable, as always, comes first in the argument list. If you don’t
need all this information, consider using the “-brief,” or “-minimal”
functions, which do less computation.

You should first look at your data with a scatter plot to see if a linear model
is plausible. See the manual for a fuller explanation of linear regression
statistics.

Calculates almost every statistic of a linear regression: the slope and
intercept of the line, the standard error on each, the correlation coefficient,
the coefficient of determination, also known as r-square, and an ANOVA table as
described in the manual.

If you don’t need all this information, consider using the “-brief” or
“-minimal” functions, which do less computation.

This function differs from ‘linear-regression-verbose’ in that it takes summary
variables: ‘x’ and ‘y’ are the sums of the independent variable and dependent
variables, respectively; ‘x2’ and ‘y2’ are the sums of the squares of the
independent variable and dependent variables, respectively; and ‘xy’ is the sum
of the products of the independent and dependent variables.

You should first look at your data with a scatter plot to see if a linear model
is plausible. See the manual for a fuller explanation of linear regression
statistics.

Does successive multiplications of each element in ‘args’. If two
elements are scalar, then their product is i * j, if a scalar is
multiplied by a matrix, then each element in the matrix is multiplied
by the scalar, lastly, if two matrices are multiplied then standard
matrix multiplication is applied, and the ranks must be such that if
ARGi is rank a x b and ARGj is rank c x d, then b must be equal to c.

MAXIMUM (DATA &KEY START END KEY)
Returns the element of the sequence ‘data’ whose ‘key’ is maximum. Signals
‘no-data’ if there is no data. If there is only one element in the data
sequence, that element will be returned, regardless of whether it is valid (a
number).

MEDIAN (DATA &KEY START END KEY)
Returns the median of the subsequence of ‘data’ from ‘start’ to ‘end’, using
‘key’. The median is just the 0.5 quantile, and so this function returns the
same values as the ‘quantile’ function.

MINIMUM (DATA &KEY START END KEY)
Returns the element of the sequence ‘data’ whose ‘key’ is minimum. Signals
‘no-data’ if there is no data. If there is only one element in the data
sequence, that element will be returned, regardless of whether it is valid (a
number).

MODE (DATA &KEY START END KEY)
Returns the most frequent element of ‘data,’ which should be a sequence. The
algorithm involves sorting, and so the data must be numbers or the ‘key’
function must produce numbers. Consider ‘sxhash’ if no better function is
available. Also returns the number of occurrences of the mode. If there is
more than one mode, this returns the first mode, as determined by the sorting of
the numbers.

This is an internal function for the use of the multiple-linear-regression
functions. It takes the lists of values given by CLASP and puts them into a
pair of arrays, A and b, suitable for solving the matrix equation Ax=b, to find
the regression equation. The values are A and b. The first column of A is the
constant 1, so that an intercept will be included in the regression model.

Let m be the number of independent variables, ‘ivs.’ This function returns a
vector of length m which are the coefficients of a linear equation that best
predicts the dependent variable, ‘dv,’ in the least squares sense. It also
returns, as the second value, the sum of squared deviations of the data from the
fitted model, aka SSE, aka chi-square. The third value is the number of degrees
of freedom for the chi-square, if you want to test the fit.

This function returns an intermediate amount of information. Consider using the
sibling functions -minimal and -verbose if you want less or more information.

Let m be the number of independent variables, ‘ivs.’ This function returns
a vector of length m which are the coefficients of a linear equation that best
predicts the dependent variable, ‘dv,’ in the least squares sense.

This function returns the minimal information for a least squares regression
model, namely a list of the coefficients of the ivs, with the constant term
first. Consider using the sibling functions -brief and -verbose if you want
more information.

Let m be the number of independent variables, ‘ivs.’ This function returns
fourteen values:
1. the intercept
2. a list of coefficients
3. a list of correlations of each iv to the dv and to each iv
4. a list of the t-statistic for each coefficient
5. a list of the standardized coefficients (betas)
6. the fraction of variance accounted for, aka r-square
7. the ratio of MSR (see #12) to MSE (see #13), aka F
8. a list of the portion of the SSR due to each iv
9. a list of the fraction of variance accounted for by each iv
10. the sum of squares of the regression, aka SSR
11. the sum of squares of the residuals, aka SSE, aka chi-square
12. the mean squared error of the regression, aka MSR
13. the mean squared error of the residuals, aka MSE
14. a list of indices of “zeroed” independent variables

This function returns a lot of information about the regression. Consider using
the sibling functions -minimal and -brief if you need less information.

MULTIPLE-MODES (DATA K &KEY START END KEY)
Returns the ‘k’ most frequent elements of ‘data,’ which should be a sequence.
The algorithm involves sorting, and so the data must be numbers or the ‘key’
function must produce numbers. Consider #’sxhash if no better function is
available. Also returns the number of occurrences of each mode. The value is
an association list of modes and their counts. This function is a little more
computationally expensive than ‘mode,’ so only use it if you really need
multiple modes.

Computes the cumulative distribution function for a Poisson random variable
with mean ‘x’ evaluated at ‘k.’ The result is the probability that the number of
Poisson random events occurring will be between 0 and k-1 inclusive, if the
expected number is ‘x.’ The argument ‘k’ should be an integer, while ‘x’ should
be a float. The implementation follows Numerical Recipes in C, section 6.2

QUANTILE (DATA Q &KEY START END KEY)
Returns the element which is the q’th percentile of the data when accessed by
‘key.’ That is, it returns the element such that ‘q’ of the data is smaller than
it and 1-‘q’ is above it, where ‘q’ is a number between zero and one, inclusive.
For example, if ‘q’ is .5, this returns the median; if ‘q’ is 0, this returns
the minimum (although the ‘minimum’ function is more efficient).

This function uses the bisection method, doing linear interpolation between
elements i and i+1, where i=floor(q(n-1)). See the manual for more information.
The function returns three values: the interpolated quantile and the two
elements that determine the interval it was interpolated in. If the quantile
was exact, the second two values are the same element of the data.

Equivalent to (* factor (round n factor)). For example, ‘round-to-factor’ of
65 and 60 is 60. Useful for converting to certain units, say when converting
minutes to the nearest hours. See also ‘truncate-to-factor.’

Performs all pairwise comparisons between group means, testing for
significance using Scheffe’s F-test. Returns an upper-triangular table in a
format described in the manual. Also see the function ‘print-scheffe-table.’

‘Group-means’ and ‘group-sizes’ should be sequences. The arguments ‘ms-error’
and ‘df-error’ are the mean square error within groups and its degrees of
freedom, both of which are computed by the analysis of variance. An ANOVA test
should always be run first, to see if there are any significant differences.

Smooths ‘data’ by successive smoothing: 4,median; then 2,median; then
5,median; then 3,median; then hanning. The ends are handled by duplicating the
end elements. This function is not destructive; it returns a list the same
length as ‘data,’ which should be a list of numbers.

Smooths ‘data’ by replacing each element with the weighted mean of it and its
two neighbors. The weights are 1/2 for itself and 1/4 for each neighbor. The
ends are handled by duplicating the end elements. This function is not
destructive; it returns a list the same length as ‘data,’ which should be a
sequence of numbers.

Smooths ‘data’ by replacing each element with the mean of it and its two
neighbors. The ends are handled by duplicating the end elements. This function
is not destructive; it returns a list the same length as ‘data,’ which should be
a sequence of numbers.

Smooths ‘data’ by replacing each element with the mean of it, its left
neighbor, and its two right neighbors. The ends are handled by duplicating the
end elements. This function is not destructive; it returns a list the same
length as ‘data,’ which should be a sequence of numbers.

Smooths ‘data’ by replacing each element with the median of it, its two left
neighbors and its two right neighbors. The ends are handled by duplicating the
end elements. This function is not destructive; it returns a list the same
length as ‘data,’ which should be a sequence of numbers.

Smooths ‘data’ by replacing each element with the median of it and its
neighbor on the left. A median of two elements is the same as their mean. The
end is handled by duplicating the end element. This function is not
destructive; it returns a list the same length as ‘data,’ which should be a
sequence of numbers.

Smooths ‘data’ by replacing each element with the median of it and its two
neighbors. The ends are handled by duplicating the end elements. This function
is not destructive; it returns a list the same length as ‘data,’ which should be
a sequence of numbers.

Smooths ‘data’ by replacing each element with the median of it, its left
neighbor, and its two right neighbors. The ends are handled by duplicating the
end elements. This function is not destructive; it returns a list the same
length as ‘data,’ which should be a sequence of numbers.

Smooths ‘data’ by replacing each element with the median of it, its two left
neighbors and its two right neighbors. The ends are handled by duplicating the
end elements. This function is not destructive; it returns a list the same
length as ‘data,’ which should be a sequence of numbers.

Student’s distribution is much like the Gaussian distribution except with
heavier tails, depending on the number of degrees of freedom, ‘dof.’ As ‘dof’
goes to infinity, Student’s distribution approaches the Gaussian. This function
computes the significance of ‘t-statistic.’ Values range from 0.0 to 1.0: small
values suggest that the null hypothesis—that ‘t-statistic’ is drawn from a t
distribution—should be rejected. The ‘t-statistic’ parameter should be a
float, while ‘dof’ should be an integer.

The null hypothesis is roughly that ‘t-statistic’ is zero; you must specify your
alternative hypothesis (H1) via the ‘tails’ parameter, which must be :both,
:positive or :negative. The first corresponds to a two-tailed test: H1 is that
‘t-statistic’ is not zero, but you are not specifying a direction. If the
parameter is :positive, H1 is that ‘t-statistic’ is positive, and similarly for
:negative.

T-TEST (SAMPLE-1 SAMPLE-2 &OPTIONAL (TAILS BOTH) (H0MEAN 0))
Returns the t-statistic for the difference in the means of two samples, which
should each be a sequence of numbers. Let D=mean1-mean2. The null hypothesis
is that D=0. The alternative hypothesis is specified by ‘tails’: ‘:both’ means
D/=0, ‘:positive’ means D>0, and ‘:negative’ means D<0. Unless you’re using
:both tails, be careful what order the two samples are in: it matters!

The function also returns the significance, the standard error, and the degrees
of freedom. Signals ‘standard-error-is-zero’ if that condition occurs. Signals
‘insufficient-data’ unless there are at least two elements in each sample.

T-TEST-MATCHED (SAMPLE1 SAMPLE2 &OPTIONAL (TAILS BOTH))
Returns the t-statistic for two matched samples, which should be equal-length
sequences of numbers. Let D=mean1-mean2. The null hypothesis is that D=0. The
alternative hypothesis is specified by ‘tails’: ‘:both’ means D/=0, ‘:positive’
means D>0, and ‘:negative’ means D<0. Unless you’re using :both tails, be
careful what order the two samples are in: it matters!

The function also returns the significance, the standard error, and the degrees
of freedom. Signals ‘standard-error-is-zero’ if that condition occurs. Signals
‘insufficient-data’ unless there are at least two elements in each sample.

T-TEST-ONE-SAMPLE (DATA TAILS &OPTIONAL (H0-MEAN 0) &KEY START END KEY)
Returns the t-statistic for the mean of the data, which should be a sequence
of numbers. Let D be the sample mean. The null hypothesis is that D equals the
‘H0-mean.’ The alternative hypothesis is specified by ‘tails’: ‘:both’ means D
/= H0-mean, ‘:positive’ means D > H0-mean, and ‘:negative’ means D < H0-mean.

The function also returns the significance, the standard error, and the degrees
of freedom. Signals ‘zero-variance’ if that condition occurs. Signals
‘insufficient-data’ unless there are at least two elements in the sample.

TRIMMED-MEAN (DATA PERCENTAGE &KEY START END KEY)
Returns a trimmed mean of ‘data.’ A trimmed mean is an ordinary, arithmetic
mean of the data, except that an outlying percentage has been discarded. For
example, suppose there are ten elements in ‘data,’ and ‘percentage’ is 0.1: the
result would be the mean of the middle eight elements, having discarded the
biggest and smallest elements. If ‘percentage’ doesn’t result in a whole number
of elements being discarded, then a fraction of the remaining biggest and
smallest is discarded. For example, suppose ‘data’ is ’(1 2 3 4 5) and
‘percentage’ is 0.25: the result is (.75(2) + 3 + .75(4))/(.75+1+.75) or 3. By
convention, the 0.5 trimmed mean is the median, which is always returned as a
number.

Equivalent to (* factor (truncate n factor)). For example,
‘truncate-to-factor’ of 65 and 60 is 60. Useful for converting to certain
units, say when converting minutes to hours and minutes. See also
‘round-to-factor.’

TUKEY-SUMMARY (DATA &KEY START END KEY)
Computes a Tukey five-number summary of the data. That is, it returns, in
increasing order, the extremes and the quartiles: the minimum, the 1/4 quartile,
the median, the 3/4 quartile, and the maximum.

VARIANCE (DATA &KEY START END KEY)
Returns the variance of ‘data,’ that is, the ‘sum-of-squares’ divided by
n-1. Signals ‘no-data’ if there is no data. Signals ‘insufficient-data’ if
there is only one datum.

5.2.3 Macros

Generate error if the value of ARG-NAME doesn’t satisfy PREDICATE.
PREDICATE is a function name (a symbol) or an expression to compute.
TYPE-STRING is a string to use in the error message, such as "a list".
ERROR-TYPE-NAME is a keyword that tells condition handlers what type was desired.

In clasp, statistical objects have two parts, a class which stores the
various parts of the object and a computing function which computes the value
of the object from arguments. The define-statistic macro allows the
definition of new statistical types. The define-statistic macro must be
provided with all the information necessary to create a statistical object,
that is, everything required to create a new class, everything required to
create a computing function and some information to connect the two. This
last part consists of a list of arguments and their types and a list which
determines how the values of a statistical function should be used to fill the
slots of a statistical object.

When define-statistic is invoked, two things happen, first a class is defined
which is a subclass of ’statistic and any other named ‘superclasses’. Second,
a pair of functions is defined. ‘clasp-statistics::name’ is an internal
function which has the supplied ‘body’ and ‘lambda-list’ and must return as
many values as there are slots in the class ‘name’. The function ‘name’ is
also defined, it is basically a wrapper function which converts its arguments
to those which are accepted by ‘body’ and then calls ‘clasp-statistics::name’.
The parameter clasp:*create-statistical-objects* determines whether the
wrapper function packages the values returned by the intern function into a
statistical object or just returns them as multiple values.

The ‘argument-types’ argument must be an alist in which the keys are the
names of arguments as given in ‘lambda-list’ and the values are lisp types
which those arguments will be converted to before calling the internal
statistical function. The primary purpose of this is to allow for coersion of
clasp variables to sequences, but any coercion which is allowed by lisp is
acceptable. The ‘values’ argument is intended to allow the programmer to
specify which slots in the statistical object are filled by which of the
values returned by the statistical function. By default, the order of the
values is assumed to be direct slots in order of specification, inherited
slots in order of specification in the superclasses which are also statistics.

5.2.4 Functions

Performs a one-way analysis of variance (ANOVA) on the ‘data,’ which should
be a sequence of sequences, where each interior sequence is the data for a
particular group. Furthermore, each sequence should consist entirely of
numbers, and each should have at least 2 elements.

The significance of the result indicates that the group means are not all equal;
that is, at least two of the groups have significantly different means. If
there were only two groups, this would be semantically equivalent to an
unmatched, two-tailed t-test, so you can think of the one-way ANOVA as a
multi-group, two-tailed t-test.

This function returns five values: 1. an ANOVA table; 2. a list a group means;
3. either a Scheffe table or nil depending on ‘scheffe-tests-p’; 4. an
alternate value for SST; and 5. a list of confidence intervals in the form
‘(,mean ,lower ,upper) for each group, if ‘confidence-intervals’ is a number between
zero and one, giving the kind of confidence interval, such as 0.9. The fourth
value is only interesting if you think there are numerical accuracy problems; it
should be approximately equal to the SST value in the ANOVA table. This
function differs from ‘anova-one-way-variables’ only in its input
representation. See the manual for more information.

Calculates the analysis of variance when there are two factors that may
affect the dependent variable. Because the input is represented as an array, we
can refer to these two factors as the row-effect and the column effect. Unlike
the one-way ANOVA, there are mathematical difficulties with the two-way ANOVA if
there are unequal cell sizes; therefore, we require all cells to be the same
size, and so the input is a three-dimensional array.

The result of the analysis is an anova-table, as described in the manual. This
function differs from ‘anova-two-way-variables’ only in its input
representation. See the manual for further discussion of analysis of variance.

Performs a chi-square test for independence of the two variables, ‘v1’ and
‘v2.’ These should be categorial variables with only two values; the function
will construct a 2x2 contingency table by counting the number of occurrences of
each combination of the variables. See the manual for more details.

Performs a chi-square test for independence of the two variables, ‘v1’ and
‘v2.’ These should be categorial variables; the function will construct a
contingency table by counting the number of occurrences of each combination of
the variables. See the manual for more details.

Calculates the chi-square statistic and corresponding p-value for the given
contingency table. The result says whether the row factor is independent of the
column factor. Does not apply Yate’s correction.

This function is just like ‘confidence-interval-z,’ except that instead of
its arguments being the actual data, it takes the following summary statistics:
‘mean’, a point estimator of the mean of some normally distributed population;
and the ‘standard-error’ of the estimator, that is, the estimated standard
deviation of the normal population. ‘Confidence’ should be a number between 0
and 1, exclusive.

Returns the critical value of some statistic. The function ‘p-function’
should be a unary function mapping statistics—x values—to their
significance—p values. The function will find the value of x such that the
p-value is ‘p-value.’ The function works by binary search. A secant method
might be better, but this seems to be acceptably fast. Only positive values of
x are considered, and ‘p-function’ should be monotonically decreasing from its
value at x=0. The binary search ends when either the function value is within
‘y-tolerance’ of ‘p-value’ or the size of the search region shrinks to less than
‘x-tolerance.’

Calculates the G-test for a contingency table. The formula for the
G-test statistic is

2 * sum[f_ij log [f_ij/f-hat_ij]]

where f_ij is the ith by jth cell in the table and f-hat_ij is the
expected value of that cell. If an expected-value-matrix is supplied,
it must be the same size as table and it is used for expected values,
in which case the G-test is a test of goodness-of-fit. If the
expected value matrix is unsupplied, it is calculated from using the
formula

e_ij = [f_i* * f_*j] / f_**

where f_i*, f_*j and f_** are the row, column and grand totals
respectively. In this case, the G-test is a test of independence. The degrees of freedom is the same as for the chi-square statistic and the significance is obtained by comparing

Returns the inner product of the two samples, which should be sequences of
numbers. The inner product, also called the dot product or vector product, is
the sum of the pairwise multiplication of the numbers. Stops when either sample
runs out; it doesn’t check that they have the same length.

Returns the most frequent element of ‘data,’ which should be a sequence. The
algorithm involves sorting, and so the data must be numbers or the ‘key’
function must produce numbers. Consider ‘sxhash’ if no better function is
available. Also returns the number of occurrences of the mode. If there is
more than one mode, this returns the first mode, as determined by the sorting of
the numbers.

Keep in mind that if the data has multiple runs of like values that are bigger
than the window size (currently defaults to 10% of the size of the data) this
function will blindly pick the first one. If this is the case you probabaly
should be calling ‘mode’ instead of this function.

Multiply matrices MATRIX-1 and MATRIX-2, storing into MATRIX-3 if supplied.
If MATRIX-3 is not supplied, then a new (ART-Q type) array is returned, else
MATRIX-3 must have exactly the right dimensions for holding the result of the multiplication.
Both MATRIX-1 and MATRIX-2 must be either one- or two-diimensional.
The first dimension of MATRIX-2 must equal the second dimension of MATRIX-1, unless MATRIX-1
is one-dimensional, when the first dimensions must match (thus allowing multiplications of the
form VECTOR x MATRIX)

Prints ‘scheffe-table’ on ‘stream.’ If the original one-way anova data had N
groups, the Scheffe table prints as an n-1 x n-1 upper-triangular table. If
‘group-means’ is given, it should be a list of the group means, which will be
printed along with the table.

Solves A X = B for a vector ‘X,’ where A is specified by the mxn array U, ‘n’
vector W, and nxn matrix V as returned by svdcmp. ‘m’ and ‘n’ are the
dimensions of ‘A,’ and will be equal for square matrices. ‘B’ is the 1xm input
vector for the right-hand side. ‘X’ is the 1xn output solution vector. All
arrays are of double-floats. No input quantities are destroyed, so the routine
may be called sequentially with different B’s. See the discussion in Numerical
Recipes in C, section 2.6.

This routine assumes that near zero singular values have already been zeroed.
It returns no values, storing the result in ‘X.’ It does use some auxiliary
storage, which can be passed in as ‘tmp,’ a double-float array of length ‘n,’ if
you want to avoid consing.

Solves A X = B for a vector ‘X,’ where A is specified by the mxn array U, ‘n’
vector W, and nxn matrix V as returned by svdcmp. ‘m’ and ‘n’ are the
dimensions of ‘A,’ and will be equal for square matrices. ‘B’ is the 1xm input
vector for the right-hand side. ‘X’ is the 1xn output solution vector. All
arrays are of single-floats. No input quantities are destroyed, so the routine
may be called sequentially with different B’s. See the discussion in Numerical
Recipes in C, section 2.6.

This routine assumes that near zero singular values have already been zeroed.
It returns no values, storing the result in ‘X.’ It does use some auxiliary
storage, which can be passed in as ‘tmp,’ a single-float array of length ‘n,’ if
you want to avoid consing.

Computes the inverse of a matrix that has been decomposed into ‘u,’ ‘w’ and
‘v’ by singular value decomposition. It assumes the “small” elements of ‘w’
have already been zeroed. It computes the inverse by taking advantage of the
known zeros in the full 2-dimensional ‘w’ matrix. It uses the backsubstitution
algorithm, only with the B vectors fixed at the columns of the identity matrix,
which lets us take advantage of its zeros. It’s about twice as fast as the slow
version and conses a lot less. Note that if you are computing the inverse
merely to solve one or more systems of equations, you are better off using the
decomposition and backsubstitution routines directly.

Computes the inverse of a matrix that has been decomposed into ‘u,’ ‘w’ and
‘v’ by singular value decomposition. It assumes the “small” elements of ‘w’
have already been zeroed. It computes the inverse by taking advantage of the
known zeros in the full 2-dimensional ‘w’ matrix. It uses the backsubstitution
algorithm, only with the B vectors fixed at the columns of the identity matrix,
which lets us take advantage of its zeros. It’s about twice as fast as the slow
version and conses a lot less. Note that if you are computing the inverse
merely to solve one or more systems of equations, you are better off using the
decomposition and backsubstitution routines directly.

Computes the inverse of a matrix that has been decomposed into ‘u,’ ‘w’ and
‘v’ by singular value decomposition. It assumes the “small” elements of ‘w’
have already been zeroed. It computes the inverse by constructing a diagonal
matrix ‘w2’ from ‘w’ (which is just a vector of the diagonal elements, and then
explicitly multiplying u^t w2 and v. Note that if you are computing the inverse
merely to solve one or more systems of equations, you are better off using the
decomposition and backsubstitution routines directly.

Computes the inverse of a matrix that has been decomposed into ‘u,’ ‘w’ and
‘v’ by singular value decomposition. It assumes the “small” elements of ‘w’
have already been zeroed. It computes the inverse by constructing a diagonal
matrix ‘w2’ from ‘w’ (which is just a vector of the diagonal elements, and then
explicitly multiplying u^t w2 and v. Note that if you are computing the inverse
merely to solve one or more systems of equations, you are better off using the
decomposition and backsubstitution routines directly.

Use singular value decomposition to compute the inverse of ‘A.’ If an exact
inverse is not possible, then zero the otherwise infinite inverted singular
value and compute the inverse. The inverse is returned; ‘A’ is not destroyed.
If you’re using this to solve several systems of equations, you’re better off
computing the singular value decomposition and using it several times, because
this function computes it anew each time.

Returns solution of linear system matrix * solution = b-vector. Employs the
singular value decomposition method. See the discussion in Numerical Recipes in
C, section 2.6, especially as to the semantics of ‘threshold.’

If the relative magnitude of an element in ‘w’ compared to the largest
element is less than ‘threshold,’ then zero that element. Returns a list of
indices of the zeroed elements. This function is just a convenient wrapper for
‘svzero-sf’ and ‘svzero-df.’

Given an ‘m’x‘n’ matrix ‘A,’ this routine computes its singular value
decomposition, A = U W V^T. The matrix U replaces ‘A’ on output. The diagonal
matrix of singular values W is output as a vector ‘W’ of length ‘n.’ The matrix
‘V’ – not the transpose V^T – is output as an ‘n’x‘n’ matrix ‘V.’ The row
dimension ‘m’ must be greater or equal to ‘n’; if it is smaller, then ‘A’ should
be filled up to square with zero rows. See the discussion in Numerical Recipes
in C, section 2.6.

This routine returns no values, storing the results in ‘A,’ ‘W,’ and ‘V.’ It
does use some auxiliary storage, which can be passed in as ‘rv1,’ a double-float
array of length ‘n,’ if you want to avoid consing.

Given an ‘m’x‘n’ matrix ‘A,’ this routine computes its singular value
decomposition, A = U W V^T. The matrix U replaces ‘A’ on output. The diagonal
matrix of singular values W is output as a vector ‘W’ of length ‘n.’ The matrix
‘V’ – not the transpose V^T – is output as an ‘n’x‘n’ matrix ‘V.’ The row
dimension ‘m’ must be greater or equal to ‘n’; if it is smaller, then ‘A’ should
be filled up to square with zero rows. See the discussion in Numerical Recipes
in C, section 2.6.

This routine returns no values, storing the results in ‘A,’ ‘W,’ and ‘V.’ It
does use some auxiliary storage, which can be passed in as ‘rv1,’ a single-float
array of length ‘n,’ if you want to avoid consing. All input arrays should be
of single-floats.

Given ‘v’ and ‘w’ as computed by singular value decomposition, computes the
covariance matrix among the predictors. Based on Numerical Recipes in C,
section 15.4, algorithm ‘svdvar.’ The covariance matrix is returned. It can be
supplied as the third argument.

If the relative magnitude of an element in ‘w’ compared to the largest
element is less than ‘threshold,’ then zero that element. If ‘report?’ is true,
the indices of zeroed elements are printed. Returns a list of the indices of
zeroed elements. This routine uses double-floats.

If the relative magnitude of an element in ‘w’ compared to the largest
element is less than ‘threshold,’ then zero that element. If ‘report?’ is true,
the indices of zeroed elements are printed. Returns a list of indices of the
zeroed elements. This routine uses single-floats.