Probability and normal curve

Social Statistics: Are your curves
normal?
1
This week
Why understanding probability is important?
 What is normal curve
 How to compute and interpret z scores.

2
What is probability?
The chance of winning a lotter
 The chance to get a head on one flip of a coin
 Determine the degree of confidence to state a
finding

3
Normal distribution

Percentages Under the Normal Curve
Almost 100% of the scores fall between (-3SD,
+3SD)
 Around 34% of the scores fall between (0, 1SD)

Are all distributions
normal?
4
Normal distribution
The distance between
contains
Range (if mean=100,
SD=10)
Mean and 1SD
34.13% of all cases
100-110
1SD and 2SD
13.59% of all cases
110-120
2SD and 3SD
2.15% of all cases
120-130
>3SD
0.13% of all cases
>130
Mean and -1SD
34.13% of all cases
90-100
-1SD and -2SD
13.59% of all cases
80-90
-2SD and -3SD
2.15% of all cases
70-80
< -3SD
0.13% of all cases
<70
5
Z score – standard score
If you want to compare individuals in different
distributions
 Z scores are comparable because they are
standardized in units of standard deviations.

6
Z score

Standard score
z
X 

X: the individual score
 : the mean
 : standard deviation
7
Sample or
population?
Standard Normal Distribution

8
Mean=0, standard deviation=1
Z score
Mean and SD for Z
distribution?
Mean=25, SD=2, what is
the z score for 23, 27, 30?
9
Z score
Z scores across different distributions are
comparable
 Z scores represent the distances from the
mean in a same measurement



Raw score 12.8 (mean=12, SD=2)  z=+0.4
Raw score 64 (mean=58, SD=15)  z=+0.4
Equal distances from the mean
10
Comparing apples and oranges:
Eric competes in two track events: standing
long jump and javelin. His long jump is 49
inches, and his javelin throw was 92 ft. He then
measures all the other competitors in both
events and calculates the mean and standard
deviation:
 Javelin: M = 86ft, s = 10ft
 Long Jump: M = 44, s = 4
 Which event did Eric do best in?

11
Excel for z score
Standardize(x, mean, standard deviation)
 (x-average(array))/STDEV(array)

12
What z scores represent?
Raw scores below the mean has negative z
scores
 Raw scores above the mean has positive z
scores
 Representing the number of standard
deviations from the mean


13
The more extreme the z score, the further it
is from the mean,
What z scores represent?
84% of all the scores fall below a z score of +1
(why?)
 16% of all the scores fall above a z score of +1
(why?)
 This percentage represents the probability of a
certain score occurring, or an event happening
 If less than 5%, then this event is unlikely to
happen

14
Exercise

In a normal distribution with a mean of 100
and a standard deviation of 10, what is the
probability that any one score will be 110 or
above?
What about 6σ
http://en.wikipedia.org/wiki/Six_Sigma
15
NORM.DIST()

NORM.DIST(z,mean,standard_dev,cumulative)



16
z: The z score value for which you want the distribution.
mean: The arithmetic mean of the distribution.
cumulative: A logical value that determines the form of
the function. If cumulative is TRUE, NORM.DIST returns
the cumulative distribution function; if FALSE, it returns the
probability mass function (which gives the probability that
a discrete random variable is exactly equal to some value).
NORM.DIST()
17
Exercise

The probability associated with z=1.38
41.62% of all the cases in the distribution fall
between mean and 1.38 standard deviation,
 About 92% falls below a 1.38 standard deviation
 How and why?

18
Between two z scores

What is the probability to fall between z score
of 1.5 and 2.5
Z=1.5, 43.32%
 Z=2.5, 49.38%
 So around 6% of the all the cases of the
distribution fall between 1.5 and 2.5 standard
deviation.

19
Exercise

20
What is the percentage for data to fall
between 110 and 125 with the distribution of
mean=100 and SD=10
Exercise

21
The probability of a particular score occurring
between a z score of +1 and a z score of +2.5
Exercise

Compute the z scores where mean=50 and
the standard deviation =5
55
 50
 60
 57.5
 46

22
Exercise

The math section of the SAT has a μ = 500
and σ = 100. If you selected a person at
random:
a) What is the probability he would have a score
greater than 650?
 b) What is the probability he would have a score
between 400 and 500?
 c) What is the probability he would have a score
between 630 and 700?

23
Determine sample size
Sample Size 
Number of Responses
Expected
Response
Needed
Rate
Expected response rate: obtain based on
historical data
 Number of responses needed: use formula to
calculate

24
Number of responses needed
Z 
2
2
n
e
x
2
n=number of responses needed (sample size)
 Z=the number of standard deviations that
describe the precision of the results
 e=accuracy or the error of the results
2

 x =variance of the data
 for large population size

25
Deciding  x
2
from previous surveys
 intentionally use a large number
 conservative estimation

e.g. a 10-point scale; assume that responses will be
found across the entire 10-point scale
 3 to the left/right of the mean describe virtually
the entire area of the normal distribution curve
2



=10/6=1.67; =2.78 (forcing 10 to be within
− 3 + 3)

26
Example
Z 
2
2
n
e
x
2
Z=1.96 (usually rounded as 2)
  2 =2.78
 e=0.2
 n=278 (responses needed)
 assume response rate is 0.4
 Sample size=278/0.4=695

27
Exercise
Z 
2
2
n
e
x
2
Z=1.96 (usually rounded as 2)
 5-point scale (suppose most of the responses
are distributed from 1-5)
 error tolerance=0.4
 assume response rate is 0.6
 What is sample size?

28
Sampling





29
How to collect data so that conclusions based on our observations can be
generalized to a larger group of observations.
Population: A group that includes all the cases (individuals, objects, or
groups) in which the researcher is interested.
Sample: A subset of cases selected from a population
Parameter: A measure (e.g., mean or standard deviation) used to describe
the population distribution.
Statistic: A measure (e.g., mean or standard deviation) used to describe the
sample distribution
Sampling
30
Probability sampling



31
A method of sampling that enables the researchers to specify
for each case in the population the probability of its inclusion
in the sample.
The purpose of probability sampling is to select a sample that
is as representative as possible of the population.
It enables the researcher to estimate the extent to which the
findings based on one sample are likely to differ from what
would be found by studying the entire population.
Simple Random Sample


32
A sample designed in such a way as to ensure that 1)
every member of the population has an equal chance
of being chosen, 2)every combination of N members
has an equal chance of being chosen.
Example: Suppose we are conducting a costcontainment study of 10 hospitals in our region, and
we want to draw a sample of two hospitals to study
intensively.
Systematic random sampling



33
A method of sampling in which every Kth member in
the total is chosen for inclusion in the sample.
K is a ratio obtained by dividing the population size
by the desired sample size.
Example: we had a population of 15,000 commuting
students and our sample was limited to 500, so
K=30. So we first choose any one student at random
from the first 30 students, then we select every 30th
student after that until reach 500.
Stratified Random Sample
A method of sampling obtained by 1) dividing
the population into subgroups based on one
or more variables central to our analysis, and
2) then drawing a simple random sample from
each of the subgroups.
 Proportionate stratified sample: the size of the
sample selected from each subgroup is
proportional to the size of that subgroup in
the entire population.

34
Disproportionate stratified sample

The size of the sample selected from each
subgroup is deliberately made disproportional
to the size of that subgroup in the population

35
A sample (N=180), with 90 whites (50%), 45
blacks (25%) and 45 Latinos (25%).
Sampling distribution
Helps estimate the likelihood of our sample
statistics and enables us to generalize from the
sample to the population.
 But population in most of times unknown
 The sampling distribution is a theoretical
probability distribution (which is never really
observed) of all possible sample values for the
statistics in which we are interested.

36
Sampling distribution
If we select 3 of them, what will
be the difference for mean and
standard deviation?
37
Sampling distribution of the mean

A theoretical probability distribution of sample
means that would be obtained by drawing from the
population all possible samples of the same size
Mean Income of
50 Samples of
Size 3 from 20
individuals
38
Sampling distribution of the mean
39
40
Standard error of the mean


It describes how
many variability
there is in the value
of the mean from
sample to sample.
It equals to the
standard deviation
of the population
divided by the
square root of the
sample size, =
41
Central Limit Theorem

If all possible random samples of size N are drawn
from a population with a mean and a standard
deviation , then as N becomes larger, the sample
distribution of sample means becomes approximately
normal, with mean equal to the population mean
and a standard deviation equal to =

42
According to central limit theorem, N (>50, or >30)
means that the sampling distribution of the mean will
be approximately normal
Estimation





43
A process whereby we select a random sample from a population and use
a sample statistic to estimate a population parameter.
Point estimate: A sample statistic used to estimate the exact value of a
population parameter. Point estimate usually results in some sort of
sampling error, therefore has less accuracy.
Confidence interval (CI): A range of values defined by the confidence level
within which the population parameter is estimated to all. Sometimes
confidence intervals are referred as a margin of error.
Confidence level: the likelihood, expressed as a percentage or probability,
that a specified interval will contain the population parameter.
Margin of error: the radius of a confidence interval.
Estimation




44
Confidence intervals are defined in terms of
confidence levels.
A 95% confidence level, there is a 0.95 probability –
or 95 chances out of 100- that a specified interval
will contain the population mean.
Most common confidence levels are: 90%, 95%, 99%
Margin of error is the radius of a confidence level. So
if we select a 95% confidence level, we would have a
5% chance of our interval being incorrect.
Notation
Mean
Standard Deviation
Sample Distribution
Population Distribution
Sampling distribution of
= ±Z( )
• A total of 68% of all random sample means will fall within ±1
standard error (standard deviation) of the true population mean. (Z=±1)
• A total of 95% of all random sample means will fall within±1.96
standard error (standard deviation) of the true population mean. (Z=±1.96)
• A total of 99% of all random sample means will fall within±2.58
standard error (standard deviation) of the true population mean. (Z=±2.58)
45
Determining the confidence interval

Follow these steps
Calculate the standard error (standard deviation)
of the mean
 Decide on the level of confidence, and find the
corresponding Z value
 Calculate the confidence interval
 Interpret the results

46
Example


To estimate the average commuting time of all
15,000 commuters on our campus (the population
parameter), we survey a random sample of 500
students, and sample mean () is 7.5 hrs/week.
Step 1: Calculate the standard error (standard deviation) of
the mean
 Let’s suppose the standard deviation for the population
=1.5
=
47
1.5
=
=0.07
500
Example


Step 2: Decide on the level of confidence, and find
the corresponding Z value
 Let’s take 95% confidence level, so Z=±1.96
Step 3: Calculate the confidence interval
= ±Z( )=7.5±1.96 0.07 = 7.5 ± 0.14 = 7.36 7.64

Step 4: Interpret the results


48
We can be 95% confident that the actual mean commuting time – the true
population mean – is no less than 7.36 and no greater than 7.64 hrs per week.
There is a 5% risk that we are wrong, which means if we collect a large
number of samples (N=500), that five samples out of 100 samples, the true
population mean will not be included in the specified interval.
Example
If we do 10 different samples, with 95% confidence level and
come out with the confidence interval, only 1 out of the 10
confidence intervals does not intersect with the vertical line
which is the true population mean
49
Estimating Sigma


50
Both the mean ( ) and the standard
deviation ( ) of the population are unknown.
When N is more than or equals to 50, the
sample standard deviation ( ) is a good
estimate of standard deviation of the
population ( )
=
=
Example






51
We will estimate the mean hours per day that Americans
spend watching TV based on the 2010 GSS. The mean hours
per day spent watching TV for a sample of N=1013 is =3.01,
and standard deviation =2.65 hrs.
Let’s use the 95% confidence interval
=
2.65
= =
=0.08
1013
Z value for the 95% confidence interval is 1.96
= ±Z( )=3.01±1.96(0.08)=3.01±0.16 =
2.85 3.17.
We are 95% confident that the actual mean hours spent
watching TV by Americans from which the GSS sample was
taken is not less than 2.85 and not greater than 3.17.
What affects confidence interval width

If other factors do not change






52
If the sample size goes up, the width gets smaller
If the sample size goes down, the width gets bigger
If the value of the sample standard deviation goes up, the
width gets bigger
If the value of the sample standard deviation goes down,
the width gets smaller
If the level of confidence goes up (95% to 99%), the width
gets bigger
If the level of confidence goes down (99% to 95%), the
width gets smaller.