Week 3: Descriptive Stats & Functions

Week 3: Descriptive Stats & Functions

The goal for this module will be to introduce you to descriptive statistics used to summarize your data and inferential statistics used to draw conclusions about a sample from the population.

Descriptive Statistcs

Data Distributions

Writing functions

Descriptive Statistcs

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of data collected.

Two primary means of describing data:
1. Central tendency: a central or typical value for a distribution
2. Spread or Variance: the extent to which a distribution is stretched or squeezed.

Descriptive Statistcs: Central tendency

Central tendency is a central or typical value for a distribution. Also called center or location

The most common measures of central tendency are:
- arithmetic mean: the numerical average of all values
- median: the value directly in the middle of the data set
- mode. the most frequent value in the data set

Descriptive Statistcs: Spread or Variance

Spread (dispersion or variability) is the extent to which a distribution is stretched or squeezed.

The most common measures of statistical dispersion
- variance: the average of the squared differences from the mean
- standard deviation: the square root of the variance
- inter-quartile range (IQR): the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data

Data distributions

A distribution contains information about the probabilities associated with the data points.

Thousands of data distributions

Data distributions

Visualizing data distributions in R

Why is knowing the distirbutions of data helpful?

Example: Simulating a normal distributions in R

R allows you to simulate different distributions using functions and arguments as parameters.

Task: Generate 1000 values of a normal distribution, with a mean of 85

Normal distribution: rnorm()

testdatasim <- rnorm(1000,85)

## [1] 83.74365 84.18087 84.86929 84.68837 85.30112 85.88301

mean(testdatasim)

## [1] 84.94085

Example: Visualizing a normal distributions in R

hist(testdatasim)

Lab 3: Simulating and visualizing a Pareto distribution

A common distribution found in data science in the “long-tail” e.g., Pareto

In lab you need to simulate a Pareto distribution: rpareto(n, m, s)

Install VGAM: install.packages(“VGAM”)

Read about rpareto using help: ??rpareto

Set m to 560000 (about the population size of Wyoming), play around with the s parameter