Download Presentation

Laboratory in Oceanography: Data and Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Involves choosing random samples with replacement from a data set and analyzing each sample data set the same way as the original data set. The number of elements in each bootstrap sample set equals the number of elements in the original data set. The range of sample estimates obtained provides a means of estimating uncertainty of the quantity being estimated.

In general, bootstrap method can be used to compute uncertainty for any functional calculation, provided the sample data set is ‘representative’ of the true distribution.

Jacknife Method

Similar to the bootstrap is the jackknife, but uses re-sampling to estimate the bias and variance of sample statistics.

null hypothesis – an assertion about a population. It is "null" in that it represents a status quo belief, such as the absence of a characteristic or the lack of an effect.

alternative hypothesis – a contrasting assertion about the population that can be tested against the null hypothesis

H1: µ ≠ null hypothesis value — (two-tailed test)

H1: µ > null hypothesis value — (right-tail test)

H1: µ< null hypothesis value — (left-tail test)

test statistic – random sample of population collected, and test statistic computed to characterize the sample. The statistic varies with type of test, but distribution under null hypothesis must be known (or assumed).

p-value - probability, under null hypothesis, of obtaining a value of the test statistic as extreme or more extreme than the value computed from the sample.

significance level - threshold of probability, typical value of a is 0.05. If p-value < a the test rejects the null hypothesis; if p-value > α, there is insufficient evidence to reject the null hypothesis.

confidence interval - estimated range of values with a specified probability of containing the true population value of a parameter.

Intro to Statistics Toolbox

Statistics Toolbox/Hypothesis Tests

Hypothesis Testing

Hypothesis tests make assumptions about the distribution of the random variable being sampled in the data. These must be considered when choosing a test and when interpreting the results.

Z-test (ztest) and the t-test (ttest) both assume that the data are independently sampled from a normal distribution.

Both the z-test and the t-test are relatively robust with respect to departures from this assumption, so long as the sample size n is large enough.

Difference between the z-test and the t-test is in the assumption of the standard deviation σ of the underlying normal distribution. A z-test assumes that σ is known; a t-test does not. Thus t-test must determine s from the sample.

http://www.stats4students.com/Essentials/Standard-Score/Overview.php

Intro to Statistics Toolbox

Statistics Toolbox/Hypothesis Tests

ztest

The test requires σ (the standard deviation of the population) to be known

The formula for calculating the z score for the z-test is:

where:

x is the sample meanμ is the mean of the population

The z-score is compared to a z-table, which contains the percent of area under the normal curve between the mean and the z-score. This table will indicate whether the calculated z-score is within the realm of chance, or if it is so different from the mean that the sample mean is unlikely to have happened by chance.

http://www.stats4students.com/Essentials/Standard-Score/Overview.php

Intro to Statistics Toolbox

Statistics Toolbox/Hypothesis Tests

ttest

Like z-test, except the t-test does not require σ to be known

The formula for calculating the t score for the t-test is:

where:

x is the sample meanμ is the mean of the population

s is the sample variance

Under the null hypothesis that the population is distributed with mean μ, the z-statistic has a standard normal distribution, N(0,1). Under the same null hypothesis, the t-statistic has Student's t distribution with n – 1 degrees of freedom.

http://www.socialresearchmethods.net/kb/stat_t.php

Intro to Statistics Toolbox

Statistics Toolbox/Hypothesis Tests

ttest2

performs a t-test of the null hypothesis that data in the vectors x and y are independent random samples from normal distributions with equal means and equal but unknown variances – unknown variances may be either equal or unequal.

The formula for calculating the score for the t-test2 is:

where:

x, y are sample meanssx, sy are the sample variances

The null hypothesis is that the two samples are distributed with the same mean.

Intro to Statistics Toolbox

Statistics Toolbox/Hypothesis Tests

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

ANOVA (ANalysis Of VAriance)

ANOVA is like a t-test among multiple (typically >2) data sets simultaneously

T-tests can be done between two data sets, or one set and a “true” value

uses the f-distribution instead of the t-distribution

assumes that all of the data sets have equal variances

One-way ANOVA is a simple special case of the linear model. The one-way ANOVA form of the model is

where:

yij is a matrix of observations, each column represents a different group.

a.j is a matrix whose columns are the group means. (The "dot j" notation means a applies to all rows of column j. That is, αij is the same for all i.)

εij is a matrix of random disturbances.

The model assumes that the columns of y are a constant plus a random disturbance. ANOVA tests if the constants are all the same.

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

One-way ANOVA

Example: Hogg and Ledolter bacteria counts in milk. Columns represent different shipments, rows are bacteria counts from cartons chosen randomly from each shipment. Do some shipments have higher counts than others?

P-value is from F statistic of hypothesis test whether bacteria counts are same.

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

One-way ANOVA (cont’d)

In this case the p-value is about 0.0001, a very small value. This is a strong indication that the bacteria counts from the different shipments are not the same. An F statistic as extreme as this would occur by chance only once in 10,000 times if the counts were truly equal.

The p-value returned by anova1 depends on assumptions about random disturbances εijin the model equation. For the p-value to be correct, these disturbances need to be: independent, normally distributed, and have constant variance.

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

Multiple Comparisons

Sometimes need to determine not just whether there are differences among means, but which pairs of means are significantly different.

In t-test, compute t-statistic and compare to a critical value. However, when testing multiple pairs, for example, if probability of t-statistic exceeding critical value is 5%, then for 10 pairs, much more likely that one of these will falsely fail that criterion.

Can perform a multiple comparison test using the multcompare function by supplying it with the stats output from anova1.

Example:

>>load hogg

>>[p,tbl,stats] = anova1(hogg);

>>[c,m] = multcompare(stats)

Example:

see Light_DO.m

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

Two-way ANOVA

Determine whether data from several groups have a common mean. Differs from one-way ANOVA in that the groups in two-way ANOVA have two categories of defining characteristics instead of one (e.g., think of two independent variables/dimensions)

Two-way ANOVA is again a special case of the linear model. The two-way ANOVA form of the model is

a.j is a matrix whose columns are deviations of each observation attributable to the first independent variable. All values in a given column of are identical, and values in each row sum to 0.

b.j is a matrix whose rows are the deviations of each observation attributable to the second independent variable. All values in a given row of are identical, and values in each column of sum to 0.

gij is a matrix of interactions. Values in each row sum to 0, and values in each column sum to 0.

εij is a matrix of random disturbances.

The model assumes that the columns of y are a series of constants plus a random disturbance. You want to know if the constants are all the same.

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

Two-way ANOVA

Example: Determine effect of car model and factory on the mileage rating of cars.

There are three models (columns) and two factories (rows). Data from first factory is in first three rows, data from second factory is in last three rows. Do some cars have different mileage than others?

>> load mileage

mileage =

33.3000 34.5000 37.4000

33.4000 34.8000 36.8000

32.9000 33.8000 37.6000

32.6000 33.4000 36.6000

32.5000 33.7000 37.0000

33.0000 33.9000 36.7000

>> cars = 3;

>> [p,tbl,stats] = anova2(mileage,cars);[p,tbl,stats] = anova1(hogg);

Intro to Statistics Toolbox

Statistics Toolbox/Analysis of Variance

Two-way ANOVA (cont’d)

In this case the p-value for the first effect is zero to four decimal places. This indicates that the effect of the first predictor varies from one sample to another.

An F statistic as extreme as this would occur by chance only once in 10,000

times if the samples were truly equal.

The p-value for the second effect is 0.0039, which is also highly significant. This indicates that the effect of the second predictor varies from one sample to another.

Does not appear to be any interaction between the two predictors. The p-value, 0.8411, means that the observed result is quite likely (84 out 100 times) given that there is no interaction.

The p-values returned by anova2 depend on assumptions about the random

disturbances εijin the model equation. For the p-values to be correct, these

disturbances need to be: independent, normally distributed, and have constant

variance.

In addition, anova2 requires that data be balanced, which means there must be the same number of samples for each combination of control variables. Other ANOVA methods support unbalanced data with any number of predictors.

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

Linear Regression Models

In statistics, linear regression models take the form of a summation of

More general linear regression models represent the relationship between a continuous response y and a continuous or categorical predictor x in the form:

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

Example (system of equations):

Suppose we have a series of measurements of stream discharge and stage, measured at n different times.

time (day) = [0 14 28 42 56 70]

stage (m) = [0.612 0.647 0.580 0.629 0.688 0.583]

discharge (m3/s) = [0.330 0.395 0.241 0.338 0.531 0.279]

Suppose we now wish to fit a rating curve to these measurements. Let x = stage, y = discharge, then we can write this series of measurements as:

yi = mxi + b, with i = 1:n.

This in turn can be written as: y = Xb, or:

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

yi = mxi + b

y = Xb

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

Example: Harmonic Analysis:

sin(q+f) = sin(q)cos(f) + sin(f)cos(q)

Let: A=Ccos(f), B=Csin(f)

=> Csin(wt+f) = Asin(wt) + Bcos(wt)

Linear regression y = Xb

Note: Tidal Harmonics can cause tidal cycle to appear asymmetric.

80

1000

)

1

40

-

s

m

100

c

(

0

d

10

e

0

2

4

6

8

10

12

14

16

18

20

22

24

e

p

T

i

m

e

(

h

o

u

r

s

)

1

S

-40

PSD (cm s–1)2

0.1

-80

0.01

0.001

0.0001

1

10

100

1000

cycles day-1

www.soes.soton.ac.uk/teaching/courses/oa311/tides_3.ppt

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

Example: Harmonic analysis (cont’d)

Southampton Surface Currents:

Harmonic analysis for M2, M4=2xM2, M6=3xM2 ...

Intro to Statistics Toolbox

Statistics Toolbox/Regression Analysis

Generalized linear models (GLM) are a flexible generalization of ordinary least squares regression. They relate the random distribution of the measured variable of the experiment (the distribution function) to the systematic (non-random) portion of the experiment (the linear predictor) through a function called the link function.

Generalized additive models (GAMs) are another extension to GLMs in which the linear predictor η is not restricted to be linear in the covariates X but is an additive function of the xis:

The smooth functions fi are estimated from the data. In general this requires a large number of data points and is computationally intensive.