128 Cards in this Set

an estimate from a sample trying to describe the true regression line from the population

observational study

a statistical study in which the subjects are not modified (just observed) so that researchers can measure and record certain characteristics

experiment (experimental study)

A statistical study in which a "treatment" is applied to the subjects (i.e. they are modified) and researchers measure the effect of the treatment

lurking variable (confounding variable)

-other variables that may influence the response that are not studied

explanatory variable

variable that explains or causes the differences in another variable, ( "x" or independent variable)

response variable

variable which is thought to depend on the value of the explanatory variable, ("y", dependent variable)

study question

the question about the population that the study is attempting to answer

population

the complete set of all individuals/objects the study is attempting to answer a question about, the whole group of individuals we are interested in

study subjects

the individuals actually measured in the study (i.e. the selected sample of individuals/objects from the population)

treatment

what the research does/gives to some or all of the study subjects; the factor whose effect is under study; also called the explanatory variable

response variable

the quantity or characteristic that is measured to determine the treatment effect

control group

group of subjects that have the same sources of variability as those receiving the treatment but does NOT receive treatment; sometimes called the placebo group

confounding factor

any factor other than the experimental treatment that can affect the response variable in the experiment

completely randomized design

a design in which the treatments in the experiment are randomly assigned to the experimental units without using matched pairs or blocks

researchers

people who make measurements

single blinding

subject doesn't know if he/she is in the treatment or control group

double blinding

neither RESEARCHERS nor SUBJECTS know where the participants are assigned between the control and treatment group

matched pair design

makes two measures on each subject

blocking design

-extension of completely randomized design
- put similar subjects into blocks, expect the blocks to differ with respect to the response variable
-then do a completely randomized experiment within each block

block

a group of subjects that are similar in some way

"blocks" refers to ...

individuals

"experimental units" refers to...

repeated time periods in which the blocks receive the varying treatments

scatter plot

used to compare variables
-must measure two variables on a common individual (an individual can be a person, place, or even time)
-then plot the two variables

positive association

this type of association occurs when the value of one variable tends to increase as the value of the other variable increases

negative association

this type of association occurs when the value of one variable tends to decrease as the value of the other variable tends to increase

non-linear association

this type of association occurs when there is no linear relationship between two values

correlation

a number that indicates the strength and the association of a straight-line relationship between two quantitative variables

strength of correlation

determined by the absolute value of the correlation, indicates the overall closeness of the points to a straight line

direction of the correlation

determined by the sign of the correlation

magnitude of r

absolute value of r, indicates the strength of the relationship

r = 1 or r = -1

indicates that there is a perfect linear relationship and all data points fall in the straight line

squared correlation, r²

this is the proportion of variation in the response variable that is explained by the explanatory variable. It is positive between 0 and 1.

Referring to a correllation

r

correlation coefficient, used to measure linear relationship between x and y

the line of best fit

-this estimates the average value of y when you know x and individual's values will vary around the predicted value
- can be used to give a prediction of a value of y, given a specific value of x

randomization test

a test on two groups when paired data is NOT available

sampling frame

a list of all individuals in the population

in hypothesis testing, population parameter =

null value

null hypothesis

-the statement being tested
-a statement that describe some aspect of the statistical behavior of a set of data
-this statement is treated as valid unless the actual behavior of the data contradicts this assumption

null value

-the specific # the parameter equals if the null hypothesis is true
- value of population parameter being tested in the null hypothesis

alternative hypothesis

- a statement that something is happening
- researchers want to prove this
- it may be a statement that the assumed status quo is false, or that there is a relationship, or there is a difference

two types of alternative hypothesis

one sided test, two sided test

one-sided test

when Ha specifies a single direction

two-sided test

when Ha includes values in both directions

p-value

the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming Ho is true

level of significance

(α) is the border line for deciding that the p-value is low enough to justify choosing the alternative hypothesis

hypothesis testing about paired differences

matched pairs design

matched pairs design

taking two measures on the same subject to see if there is a difference between the two measurements

paired t-test

a one-sample t-test used on the sample of differences to examine whether the sample mean difference is significantly different from 0

sampling distribution

-describes the possible values the statistic might have when random samples are taken from a population

the distribution of statistics ("xbar" or "p hat") for all possible samples from the same population of a given sample size (n)

statistical inference

gives us methods for drawing conclusions about a population based on data from samples

confidence interval

an interval of values computed from sample data that is likely to include the true population

standard error

is the estimated standard deviation of the sample distribution of the statistic

confidence level

proportion of samples for which the confidence interval will capture the true parameters, % of time we expect the procedure to work, determines how frequently the observed interval contains the parameter

standard error of sample mean

(s) is the sample standard deviation

statistic

a number summarized by the same characteristic of the sample data, computed from the sample values, a known value that varies from sample to sample

is the distribution of possible values of the statistic for repeated samples of the same size taken from the same population

sampling distribution

mean of a sampling distribution

the average of all possible values of the statistic for repeated samples of the same size from a population

the standard deviation(SD) of a sampling distribution

measures the average distance of the possible values of the statistic from the mean of the sampling distribution, roughly speaking

there is a difference between N and n!
N=
n=

n= sample size (number of values in one sample/subgroup)
N= number of samples (number of subgroups)

Law of Large Numbers (LLN)

as you average more observations, sample mean settles down at population mean

graphs used for categorical variables

1. pie chart
2. bar graph

graphic representations for quantitative variables

1. histogram
2. stem-and-leaf plot
3. box plot

standard deviation

a value that measures the variability (spread) of data.

density curve

the outline of the histogram which approximates the overall pattern of a distribution

1. Its always on or above the horizontal axis
2. It has area of exactly 1 underneath it

standard normal distribution

-this is a normal distribution with a mean of 0 and a standard deviation of 1
-all other normal distributions are compared to this

z-score

(a standardized value) that is the distance between a specified value and the mean, measured in number of standard deviations

observation (individual)

an individual or the value of a single measurement

variable

a characteristic that can differ from one individual to the next

categorical variables

the observational units are being divided into units, there is no special ordering of the categories

ordinal variables

the observational units are being divided into categories which have an order

basically a categorical variable with ordered categories

quantitative variables

-variables that take numerical values
- you should be able to do mathematical operations with these numbers such as adding, multiplying, etc.
(A social security number would not be one of these)

graphs for quantitative variables

1. Histogram
2. Stem-and-Leaf Plot
3. Dot Plot

Pie Chart

each slice of a pie corresponds to a category and the size of the angle of the slice shows the percentage of the individuals in the corresponding category

Bar Graph

-each category is presented as a bar
- the height of the bar represents the number (or percentage) of individuals in the corresponding category

range

highest value subtract the lowest value

histogram

bar graphs for a quantitative range of possible value are broken into categories

frequency

actual number of individuals who fall into each interval (of a histogram)

relative frequency

proportion or percentage that are in an interval (of a histogram)

stem and leaf plot

every individual data value is shown

dot plot

display a dot for each observation along a number line

distribution

the overall pattern of how often the possible values occur

shape of a distribution

shows how values are distributed in a distribution

center

location, average, mean and median measure this

outlier

unusual values that do not fit with the rest of the pattern
(may be due to data entry errors or may be actual unusual values)

symmetric distribution

one half of the distribution is the mirror image of the other (bell shape)

bimodal distributions

has two peaks which can be caused by two or more groups of values in the sample

multimodal distribution

distribution with several peaks

median

the middle number of the data when it is ordered, 50% of the data is above it and 50% of the data is below it

two measures of the center

mean and median

symmetric distribution
(mean ? median)

mean = median

right skewed distribution
(mean ? median)

mean>median
mean is greater than median

left skewed distribution
(mean ? median)

mean<median
mean is less than median

First Quartile (Q1)

25% of the data is at or below this number

Third Quartile (Q3)

75% of the data is at or below this number

Inter-Quartile Range (IQR)

A value describing the spread over approximately the middle 50% of the data

the five number summary includes

1) maximum
2) minimum
3) Q1
4) median
5) Q3

boxplot

a graphical representation of the 5 number summary

1.5*IQ Rule

an outlier is any value that lays more than one and a half times the length of the box

variance

measures the distance of all individuals from the mean

strata

sub groups of population which might have different responses to the question of interest

stratified sample

is a collection of samples taken in each stratum of the population

cluster samples

sampling technique used when natural groups are evident in a statistical population

systematic samples

select ever k-th individual from the sampling frame

under coverage

sampling frame does not include all the population

over coverage

sampling frame includes individuals who are not in the population being examined

data entry errors

person recording the data makes mistakes

question wording error

the set up of the question can have a big influence on the answers

definition of statistics

a collection of procedures and principles for gathering data and analyzing information to help people make decisions when face with uncertainty

individuals

the objects described by the data set
(each student in the class is an observational unit or individual)

variables

characteristics of the individuals
(max speed, sex of the students, height, time of sleep)

sample

subgroup of the population examined to measure the variables and gather information

parameter

a number that describes a characteristic of the population. It is mostly a summary of a population. It's value is unknown.

statistic

summary of a sample, the value of this is usually known

census

taken to measure ALL individuals in the population

selection bias

this method of selection of participants favors a particular outcome

non response bias

some part of the individuals in the sample cannot be reached or do not respond, this creates a bias because respondents may differ in meaningful ways from non-respondents.

response bias

participants give incorrect information

response rate

the proportion of the sample that responded to the question

non-response rate

the proportion of the sample that didn't respond to the question

convenience samples

investigators choose individuals that are easy to reach

volunteer response samples

individuals decide whether to answer the questions or not

simple random sample

definition?

statistical significance

a result is unlikely to have occurred just by chance

practical significance

the difference from the claimed value we observe is actually meaningful

numbers in"stem"column of stem and leaf plot

first digit of each number in the data set

numbers in"leaf"column of stem and leaf plot

contains only the last digit of the # regardless of whether it falls before or after the decimal point