This section describes the means of traversing, searching, discovering and utilizing the SOCR Statistics EBook resources in both formal and informal learning setting. The [[EBook_Problems |problems of each section in the E-Book]] are shown here.

This section describes the means of traversing, searching, discovering and utilizing the SOCR Statistics EBook resources in both formal and informal learning setting. The [[EBook_Problems |problems of each section in the E-Book]] are shown here.

-

===[[EBook_copyright | Copyrights]]

+

===[[EBook_copyright | Copyrights]]===

The Probability and Statistics EBook is on freely and openly accessible electronic book developed by SOCR and the general community.

The Probability and Statistics EBook is on freely and openly accessible electronic book developed by SOCR and the general community.

Revision as of 19:20, 31 August 2010

This is a General Statistics Curriculum E-Book, which includes Advanced-Placement (AP) materials.

There are 4 novel features of this specific Statistics EBook. It is community-built, completely open-access (in terms of use and contributions), blends information technology, scientific techniques and modern pedagogical concepts, and is multilingual.

This section describes the means of traversing, searching, discovering and utilizing the SOCR Statistics EBook resources in both formal and informal learning setting. The problems of each section in the E-Book are shown here.

Chapter I: Introduction to Statistics

Although natural phenomena in the real life are unpredictable, the designs of experiments are bounded to generate data that varies because of intrinsic (internal to the system) or extrinsic (due to the ambient environment) effects.
How many natural processes or phenomena in the real life that have an exact mathematical closed-form description and are completely deterministic can we describe? How do we model the rest of the processes that are unpredictable and have random characteristics?

Statistics is the science of variation, randomness and chance. As such, statistics is different from other sciences, where the processes being studied obey exact deterministic mathematical laws. Statistics provides quantitative inference represented as long-time probability values, confidence or prediction intervals, odds, chances, etc., which may ultimately be subjected to varyious interpretations. The phrase Uses and Abuses of Statistics refers to the notion that in some cases statistical results may be used as evidence to seemingly opposite theses. However, most of the time, common principles of logic allow us to disambiguate the obtained statistical inference.

Design of experiments is the blueprint for planning a study or experiment, performing the data collection protocol and controlling the study parameters for accuracy and consistency. Data, or information, is typically collected in regard to a specific process or phenomenon being studied to investigate the effects of some controlled variables (independent variables or predictors) on other observed measurements (responses or dependent variables). Both types of variables are associated with specific observational units (living beings, components, objects, materials, etc.)

All methods for data analysis, understanding or visualizing are based on models that often have compact analytical representations (e.g., formulas, symbolic equations, etc.) Models are used to study processes theoretically. Empirical validations of the utility of models are achieved by inputting data and executing tests of the models. This validation step may be done manually, by computing the model prediction or model inference from recorded measurements. This process may be possibly done by hand, but only for small numbers of observations (<10). In practice, we write (or use existent) algorithms and computer programs that automate these calculations for greater efficiency, accuracy and consistency in applying models to larger datasets.

There are many different ways to display and graphically visualize data. These graphical techniques facilitate the understanding of the dataset and enable the selection of an appropriate statistical methodology for the analysis of the data.

There are three main features of populations (or sample data) that are always critical in understanding and interpreting their distributions - Center, Spread and Shape. The main measures of centrality are Mean, Median and Mode(s).

There are many measures of (population or sample) spread, e.g., the range, the variance, the standard deviation, mean absolute deviation, etc. These are used to assess the dispersion or variation in the population.

Graphical visualization and interrogation of data are critical components of any reliable method for statistical modeling, analysis and interpretation of data.

Chapter III: Probability

Probability is important in many studies and discipline because measurements, observations and findings are often influenced by variation. In addition, probability theory provides the theoretical groundwork for statistical inference.

There are many important rules for computing probabilities of composite events. These include conditional probability, statistical independence, multiplication and addition rules, the law of total probability and the Bayesian Rule.

There are many useful counting principles (including permutations and combinations) to compute the number of ways that certain arrangements of objects can be formed. This allows counting-based estimation of complex events' probabilities.

Chapter IV: Probability Distributions

There are two basic types of processes that we observe in nature - Discrete and Continuous. We begin by discussing several important discrete random processes, emphasizing the different distributions, expectations, variances and applications. In the next chapter, we will discuss their continuous counterparts. The complete list of all SOCR Distributions is available here.

To simplify the calculations of probabilities, we will define the concept of a random variable which will allow us to study uniformly various processes with the same mathematical and computational techniques.

The expectation and the variance for any discrete random variable or process are important measures of Centrality and Dispersion. This section also presents the definitions of some common population- or sample-based moments.

The Geometric, Hypergeometric, Negative Binomial, and Negative Multinomial distributions provide computational models for calculating probabilities for a large number of experiment and random variables. This section presents the theoretical foundations and the applications of each of these discrete distributions.

The Poisson distribution models many different discrete processes where the probability of the observed phenomenon is constant in time or space. Poisson distribution may be used as an approximation to the Binomial distribution.

Chapter V: Normal Probability Distribution

The Normal Distribution is perhaps the most important model for studying quantitative phenomena in the natural and behavioral sciences - this is due to the Central Limit Theorem. Many numerical measurements (e.g., weight, time, etc.) can be well approximated by the normal distribution.

The Standard Normal Distribution is the simplest version (zero-mean, unit-standard-deviation) of the (General) Normal Distribution. Yet, it is perhaps the most frequently used version because many tables and computational resources are explicitly available for calculating probabilities.

In practice, the mechanisms underlying natural phenomena may be unknown, yet the use of the normal model can be theoretically justified in many situations to compute critical and probability values for various processes.

In addition to being able to compute probability (p) values, we often need to estimate the critical values of the Normal Distribution for a given p-value.

Chapter VI: Relations Between Distributions

In this chapter, we will explore the relationships between different distributions. This knowledge will help us to compute difficult probabilities using reasonable approximations and identify appropriate probability models, graphical and statistical analysis tools for data interpretation.
The complete list of all SOCR Distributions is available here and the Distributome applet provides an interactive graphical interface for exploring the relations between different distributions.

The exploration of the relations between different distributions begins with the study of the sampling distribution of the sample average. This will demonstrate the universally important role of normal distribution.

Suppose the relative frequency of occurrence of one event whose probability to be observed at each experiment is p. If we repeat the same experiment over and over, the ratio of the observed frequency of that event to the total number of repetitions converges towards p as the number of experiments increases. Why is that and why is this important?

Binomial Distribution is much simpler to compute, compared to Hypergeometric, and can be used as an approximation when the population sizes are large (relative to the sample size) and the probability of successes is not close to zero.

The Poisson can be approximated fairly well by Normal Distribution when λ is large.

Chapter VII: Point and Interval Estimates

Estimation of population parameters is critical in many applications. Estimation is most frequently carried in terms of point-estimates or interval (range) estimates for population parameters that are of interest.

There are many ways to obtain point (value) estimates of various population parameters of interest, using observed data from the specific process we study. The method of moments and the maximum likelihood estimation are among the most popular ones frequently used in practice.

Next, we discuss point and interval estimates when the sample-sizes are small. Naturally, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large-samples.

In many processes and experiments, controlling the amount of variance is of critical importance. Thus the ability to assess variation, using point and interval estimates, facilitates our ability to make inference, revise manufacturing protocols, improve clinical trials, etc.

Chapter VIII: Hypothesis Testing

Hypothesis Testing is a statistical technique for decision making regarding populations or processes based on experimental data. It quantitatively answers the possibility that chance alone might be responsible for the observed discrepancies between a theoretical model and the empirical observations.

When the sample size is large, the sampling distribution of the sample proportion is approximately Normal, by CLT. This helps us formulate hypothesis testing protocols and compute the appropriate statistics and p-values to assess significance.

The significance testing for the variation or the standard deviation of a process, a natural phenomenon or an experiment is of paramount importance in many fields. This chapter provides the details for formulating testable hypotheses, computation, and inference on assessing variation.

Chapter IX: Inferences From Two Samples

In this chapter, we continue our pursuit and study of significance testing in the case of having two populations. This expands the possible applications of one-sample hypothesis testing we saw in the previous chapter.

Independent Samples designs refer to experiments or observations where all measurements are individually independent from each other within their groups and the groups are independent. In this section, we discuss inference based on independent samples.

This section presents the significance testing and inference on equality of proportions from two independent populations.

Chapter X: Correlation and Regression

Many scientific applications involve the analysis of relationships between two or more variables involved in a process of interest. We begin with the simplest of all situations where Bivariate Data (X and Y) are measured for a process and we are interested in determining the association, relation or an appropriate model for these observations (e.g., fitting a straight line to the pairs of (X,Y) data).

Now we focus on decomposing the variance of a dataset into (independent/orthogonal) components when we have two (grouping) factors. This procedure called Two-Way Analysis of Variance.

Chapter XII: Non-Parametric Inference

To be valid, many statistical methods impose (parametric) requirements about the format, parameters and distributions of the data to be analyzed. For instance, the Independent T-Test requires the distributions of the two samples to be Normal, whereas Non-Parametric (distribution-free) statistical methods are often useful in practice, and are less-powerful.

The Sign Test and the Wilcoxon Signed Rank Test are the simplest non-parametric tests which are also alternatives to the One-Sample and Paired T-Test. These tests are applicable for paired designs where the data is not required to be normally distributed.

The Wilcoxon-Mann-Whitney (WMW) Test (also known as Mann-Whitney U Test, Mann-Whitney-Wilcoxon Test, or Wilcoxon rank-sum Test) is a non-parametric test for assessing whether two samples come from the same distribution.

In this section, we will provide the basic framework for Bayesian statistical inference. Generally, we take some prior beliefs about some hypothesis and then modify these prior beliefs, based on some data that we collect, in order to arrive at posterior beliefs. Another way to think about Bayesian Inference is that we are using new evidence or observations to update the probability that a hypothesis is true.

Hierarchical linear models are statistical models of parameters that vary at more than a level. These models are seen as generalizations of linear models and may extend to non-linear models. Any underlying correlations in the particular model must be represented in analysis for correct inference to be drawn.