I am excited to present at the next Vancouver SAS User Group (VanSUG) meeting on Wednesday, November 4, 2015. I will illustrate data transposition and ANOVA in SAS and JMP using potato chips and analytical chemistry. Come and check it out! The following agenda contains all of the presentations, and you can register for this meeting on the SAS Canada web site. This meeting is free, and a free breakfast will be served in the morning.

In analytical chemistry, the quantity of interest is often estimated from a calibration line. A technique or instrument generates the analytical response for the quantity of interest, so a calibration line is constructed from generating multiple responses from multiple standard samples of known quantities. Linearity refers to how well a plot of the analytical response versus the quantity of interest follows a straight line. If this relationship holds, then an analytical response can be generated from a sample containing an unknown quantity, and the calibration line can be used to estimate the unknown quantity with a confidence interval.

In pharmaceutical chemistry, one of the requirements for method validation is specificity, the ability of an analytical method to distinguish the analyte from other chemicals in the sample. The specificity of the method may be assessed by deliberately adding impurities into a sample containing the analyte and testing how well the method can identify the analyte.

This series of posts introduced various methods of exploratory data analysis, providing theoretical backgrounds and practical examples. Fully commented and readily usable R scripts are available for all topics for you to copy and paste for your own analysis! Most of these posts involve data visualization and plotting, and I include a lot of detail and comments on how to invoke specific plotting commands in R in these examples.

I will return to this blog post to add new links as I write more tutorials.

Introduction

My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression. Unfortunately, that advice has turned out to vastly underestimate the variety and depth of problems that I have encountered in statistical consulting, and the emphasis on linear regression has not paid dividends in my statistics career so far. Wisdom from veteran statisticians and my own experience combine to suggest that logistic regressionis actually much more commonly used in industry than linear regression. I have already started a series of short lessons on binary classification in my Statistics Lesson of the Day and Machine Learning Lesson of the Day. In this post, I will show how to perform logistic regression in both R and SAS. I will discuss how to interpret the results in a later post.

This data set has 3 variables (I have renamed them for convenience in my R programming).

ha2 – Whether or not a patient had a second heart attack. If ha2 = 1, then the patient had a second heart attack; otherwise, if ha2 = 0, then the patient did not have a second heart attack. This is the response variable.

treatment – Whether or not the patient completed an anger control treatment program.

Chebyshev’s inequality is just a special version of Markov’s inequality; thus, their motivations and intuitions are similar.

Markov’s inequality roughly says that a random variable is most frequently observed near its expected value, . Remarkably, it quantifies just how often is far away from . Chebyshev’s inequality goes one step further and quantifies that distance between and in terms of the number of standard deviations away from. It roughly says that the probability of being standard deviations away from is at most. Notice that this upper bound decreases as increases – confirming our intuition that it is highly improbable for to be far away from .

As with Markov’s inequality, Chebyshev’s inequality applies to any random variable , as long as and are finite. (Markov’s inequality requires only to be finite.) This is quite a marvelous result!

Introduction

The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. Given 2 categorical random variables, and , the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test with the following null and alternative hypotheses:

If you’re not familiar with probabilistic independence and how it manifests in categorical random variables, watch my video on calculating expected counts in contingency tables using joint and marginal probabilities. For your convenience, here is another video that gives a gentler and more practical understanding of calculating expected counts using marginal proportions and marginal totals.

Today, I will continue from those 2 videos and illustrate how the chi-squared test of independence can be implemented in both R and SAS with the same example.

In my statistics classes, I learned to use the variance or the standard deviation to measure the variability or dispersion of a data set. However, consider the following 2 hypothetical cases:

the standard deviation for the incomesof households in Canada is $2,000

the standard deviation for the incomes of the 5 major banks in Canada is $2,000

Even though this measure of dispersion has the same value for both sets of income data, $2,000 is a significant amount for a household, whereas $2,000 is not a lot of money for one of the “Big Five” banks. Thus, the standard deviation alone does not give a fully accurate sense of the relative variability between the 2 data sets. One way to overcome this limitation is to take the mean of the data sets into account.

A useful statistic for measuring the variability of a data set while scaling by the mean is the sample coefficient of variation:

where is the sample standard deviation and is the sample mean.

Analogously, the coefficient of variation for a random variable is

where is the random variable’s standard deviation and is the random variable’s expected value.

The coefficient of variation is a very useful statistic that I, unfortunately, never learned in my introductory statistics classes. I hope that all new statistics students get to learn this alternative measure of dispersion.

with being called the dependent variable and being called the independent variable. This terminology holds true for more complicated functions with multiple variables, such as in polynomial regression.

I highly discourage the use of “independent” and “dependent” in the context of statistics and regression, because these terms have other meanings in statistics. In probability, 2 random variables and are independent if their joint distribution is simply a product of their marginal distributions, and they are dependent if otherwise. Thus, the usage of “independent variable” for a regression model with 2 predictors becomes problematic if the model assumes that the predictors are random variables; a random effects model is an example with such an assumption. An obvious question for such models is whether or not the independent variables are independent, which is a rather confusing question with 2 uses of the word “independent”. A better way to phrase that question is whether or not the predictors are independent.

Thus, in a statistical regression model, I strongly encourage the use of the terms “response variable” or “target variable” (or just “response” and “target”) for and the terms “explanatory variables”, “predictor variables”, “predictors”, “covariates”, or “factors” for .

(I have encountered some statisticians who prefer to reserve “covariate” for continuous predictors and “factor” for categorical predictors.)

However, this model is still a linear regression model, because the response variable is still a linear combination of the regression coefficients. The regression coefficients would still be estimated using linear algebra through the method of least squares.

Remember: the “linear” in linear regression refers to the linearity between the response variable and the regression coefficients, NOT between the response variable and the explanatory variable(s).

This type of mean is useful for measuring the average of rates. For example, consider a car travelling for 240 kilometres at 2 different speeds:

60 km/hr for 120 km

40 km/hr for another 120 km

Then its average speed for this trip is

Notice that the speed for the 2 trips have equal weight in the calculation of the harmonic mean – this is valid because of the equal distance travelled at the 2 speeds. If the distances were not equal, then use a weighted harmonic mean instead – I will cover this in a later lesson.

To confirm the formulaic calculation above, let’s use the definition of average speed from physics. The average speed is defined as

We already have the elapsed distance – it’s 240 km. Let’s find the time elapsed for this trip.

Thus,

Notice that this explicit calculation of the average speed by the definition from kinematics is the same as the average speed that we calculated from the harmonic mean!

To evaluate the predictive accuracy of a binary classifier, two useful (but imperfect) criteria are sensitivity and specificity.

Sensitivity is the proportion of truly positives cases that were classified as positive; thus, it is a measure of how well your classifier identifies positive cases. It is also known as the true positive rate. Formally,

Specificity is the proportion of truly negative cases that were classified as negative; thus, it is a measure of how well your classifier identifies negative cases. It is also known as the true negative rate. Formally,

is the number of levels in each factor; note that the notation assumes that all factors have the same number of levels.

If a factor has 2 levels, then the levels are usually coded as and .

If a factor has 3 levels, then the levels are usually coded as , , and .

is the number of factors in the experiment

is the number of times that the full factorial design is fractionated by . This number is badly explained by most textbooks and web sites that I have seen, because they simply say that is the fraction – this is confusing, because a fraction has a numerator and a denominator, and is just 1 number. To clarify,

the fraction is

the number of treatments in the fractional factorial design is multiplied by the total possible number of treatments in the full factorial design, which is .

If all possible treatments are used in the experiment, then a full factorial design is used. If a fractional factorial design is used instead, then denotes the fraction of the treatments that is used.

However, most sources that I have read do not bother to mention that can be greater than 2; experiments with 3-level factors are less frequent but still common. Thus, the terms half-fraction design and half-quarter design only apply to binary factors. If , then

Consider again an experiment that seeks to determine the causal relationships between factors and the response, where . Ideally, the sample size is large enough for a full factorial design to be used. However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead. Such a design assigns the experimental units to a select fraction of the treatments; these treatments are chosen carefully to investigate the most significant causal relationships, while leaving aside the insignificant ones.

When, then, are the significant causal relationships? According to the sparsity-of-effects principle, it is unlikely that complex, higher-order effects exist, and that the most important effects are the lower-order effects. Thus, assign the experimental units so that main (1st-order) effects and the 2nd-order interaction effects can be investigated. This may neglect the discovery of a few significant higher-order effects, but that is the compromise that a fractional factorial design makes when the sample size available is low and the number of possible treatments is high.

An experimenter may seek to determine the causal relationships between factors and the response, where . On first instinct, you may be tempted to conduct separate experiments, each using the completely randomized design with 1 factor. Often, however, it is possible to conduct 1 experiment with factors at the same time. This is better than the first approach because

it is faster

it uses less resources to answer the same questions

the interactions between the factors can be examined

Such an experiment requires the full factorial design; in this design, the treatments are all possible combinations of all levels of all factors. After controlling for confounding variables and choosing the appropriate range and number of levels of the factor, the different treatments are applied to the different groups, and data on the resulting responses are collected.

The simplest full factorial experiment consists of 2 factors, each with 2 levels. Such an experiment would result in treatments, each being a combination of 1 level from the first factor and 1 level from the second factor. Since this is a full factorial design, experimental units are independently assigned to all treatments. The 2-factor ANOVA model is commonly used to analyze data from such designs.

In later lessons, I will discuss interactions and 2-factor ANOVA in more detail.

In my recent lesson on controlling for confounders in experimental design, the control group was described as one that received a neutral or standard treatment, and the standard treatment may simply be nothing. This is a negative control group. Not all experiments require a negative control group; some experiments instead have positive control group.

A positive control group is a group of experimental units that receive a treatment that is known to cause an effect on the response. Such a causal relationship would have been previously established, and its inclusion in the experiment allows a new treatment to be compared to this existing treatment. Again, both the positive control group and the experimental group experience the same experimental procedures and conditions except for the treatment. The existing treatment with the known effect on the response is applied to the positive control group, and the new treatment with the unknown effect on the response is applied to the experimental group. If the new treatment has a causal relationship with the response, both the positive control group and the experimental group should have the same responses. (This assumes, of course, that the response can only be changed in 1 direction. If the response can increase or decrease in value (or, more generally, change in more than 1 way), then it is possible for the positive control group and the experimental group to have the different responses.

In short, in an experiment with a positive control group, an existing treatment is known to “work”, and the new treatment is being tested to see if it can “work” just as well or even better. Experiments to test for the effectiveness of a new medical therapies or a disease detector often have positive controls; there are existing therapies or detectors that work well, and the new therapy or detector is being evaluated for its effectiveness.

Experiments with positive controls are useful for ensuring that the experimental procedures and conditions proceed as planned. If the positive control does not show the expected response, then something is wrong with the experimental procedures or conditions, and any “good” result from the new treatment should be considered with skepticism.

The word “experiment” can mean many different things in various contexts. In science and statistics, it has a very particular and subtle definition, one that is not immediately familiar to many people who work outside of the field of experimental design. This is the first of a series of blog posts to clarify what an experiment is, how it is conducted, and why it is so central to science and statistics.

Experiment: A procedure to determine the causal relationship between 2 variables – an explanatory variable and a response variable. The value of the explanatory variable is changed, and the value of the response variable is observed for each value of the explantory variable.

An experiment can have 2 or more explanatory variables and 2 or more response variables.

In my experience, I find that most experiments have 1 response variable, but many experiments have 2 or more explanatory variables. The interactions between the multiple explanatory variables are often of interest.

All other variables are held constant in this process to avoid confounding.

Response Variable: The variable whose values are observed by the experimenter as the explanatory variable’s value is changed. This variable is the effect in the hypothesis. (*Many people call this the dependent variable. Further to my previous point about “independent variables”, dependence means something very different in statistics, and I discourage using this usage.)

Factor Level: Each possible value of the factor (explanatory variable). A factor must have at least 2 levels.

Treatment: Each possible combination of factor levels.

If the experiment has only 1 explanatory variable, then each treatment is simply each factor level.

If the experiment has 2 explanatory variables, X and Y, then each treatment is a combination of 1 factor level from X and 1 factor level from Y. Such combining of factor levels generalizes to experiments with more than 2 explanatory variables.

Experimental Unit: The object on which a treatment is applied. This can be anything – person, group of people, animal, plant, chemical, guitar, baseball, etc.

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets. This has been corrected in this post by

– defining new variables called max.ozone and max.ozone2

– using the options “from = 0” and “to = max.ozone” or “to = max.ozone2” in the density() function when defining density.ozone and density.ozone2 in the R code.

Update on February 2, 2014:

Harlan also noted in the above comment that any truncated kernel density estimator (KDE) from density() in R does not integrate to 1 over its support set. Thanks to Julian Richer Daily for suggesting on AnalyticBridge to scale any truncated kernel density estimator (KDE) from density() by its integral to get a KDE that integrates to 1 over its support set. I have used my own function for trapezoidal integrationto do so, and this has been added below.

I thank everyone for your patience while I took the time to write a post about numerical integration before posting this correction. I was in the process of moving between jobs and cities when Harlan first brought this issue to my attention, and I had also been planning a major expansion of this blog since then. I am glad that I have finally started a series on numerical integration to provide the conceptual background for the correction of this error, and I hope that they are helpful. I recognize that this is a rather late correction, and I apologize for any confusion.

Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation. It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R. It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.