Working with a business audience, I am frequently called upon to send analytic results to clients in the form of Excel Workbooks. The xlsx package facilitates exporting tables and datasets Excel, but I wanted a very simple function that would let me easily export an arbitrary number of R objects to an Excel Workbook in a single call. Each object should appear on in own worksheet, and the worksheets should be named after their objects.

should save the R objects mtcars (a data frame), Titanic (a table), AirPassengers (a time series) and state.x77 (a matrix) to the workbook myworkbook.xlsx. Each object should be in it’s own worksheet and the worksheet should take on the name of the object.

One solution was to write a wrapper for write.xlsx() function in the xlsx package.

I will be presenting a one day professional development workshop on modern data visualization with R, sponsored by the ACM San Francisco Bay Area Professional Chapter. The course will cover a wide range of topics including base graphics, lattice graphs, ggplot2, and the use of interactive graphics (including iPlots, rrgobi, and googleVis).

Permuation tests (also called randomization or re-randomization tests) have been around for a long time, but it took the advent of high-speed computers to make them practically available. They can be particularly useful when your data are sampled from unkown distributions, when sample sizes are small, or when outliers are present.

R has two powerful packages for permutation tests – the coin package and the lmPerm package. In this post, we will take a look at the later.

The lmPerm package provides permutation tests for linear models and is particularly easy to impliment. You can use it for all manner of ANOVA/ANCOVA designs, as well as simple, polynomial, and multiple regression. Simply use lmp() and aovp() where you would have used lm() and aov().

Example

Consider the following analysis of covariance senario. Seventy five pregnant mice are divided into four groups and each group receives a different drug dosage (0, 5, 50, or 500) during pregnancy. Does the dosage of the drug affect the birthweight of the resulting litters, after controlling for gestation time and litter size?

The data are contained in the litter dataframe available in the multcomp package. The dependent variable is weight (average post-birth weights for each litter). The independent variable is dose, and gestation time andlitter size are covariates contained in the variables gesttime and number respectively.

If we were going to carry out a traditional ANCOVA on this data, it would look something like this:

It appears that while litter size and gestation time are significantly related to average birthweight for a litter, the drug dosage is not (p = 0.062). (Note that the order of effects in the aov statement is important. Effects later in the list are adjusted for effects earlier in the list. This is the sequential or Type I sums of squares approach.)

One of the assumptions of the ANCOVA model is that residuals are normally distributed. Let’s take a look.

From the graph, we have to question the normality assumption here. Note the deviations from the line.

An alternative to the tradional analysis of covariance is a permutation verion of the test. The test is valid even if we violate the normality assumption. To perform the test, simply replace the aov() function with aovp().

There two things to note here. First, the aovp() function calculates unique sums of squares (also called Type III SS). Each effect is adjusted for all other effects, so order does not matter. Second, the dose effect is now significant (p = 0.038), suggesting that drug dose impacts birth weight after controlling for litter size and gestation period.

There are many situations were traditional linear model significance tests are not optimal (including when data is notably non-normal, there are outliers, and when sample sizes are too small to trust asymptotic results). In these cases, permuation tests may be viable alternatives.

To learn more about permuation tests in R see chapter 12 of R in Action.

An introduction to R for sofware developers and data analysts
Saturday March 10th, 2012
8:30-5:00pm
EBay
2161 North 1st Street
San Jose, California

I will be presenting a one day professional development workshop on R programming for software developers and data scientists, sponsored by the ACM San Francisco Bay Area Professional Chapter and Revolution Analytics.

R has some great functions for generating scatterplots in 3 dimensions. Two of the best are the scatter3d() function in John Fox’s car package, and the scatterplot3d() function in Uwe Ligges’ scatterplot3d package. In this post, we will focus on the later.

Let’s say that we want to plot automobile mileage vs. engine displacement vs. car weight using the data in the mtcars dataframe.

Next, let’s label the points. We can do this by saving the results of the scatterplot3d() function to an object, using the xyz.convert() function to convert coordinates from 3D (x, y, z) to 2D-projections (x, y), and apply the text() function to add labels to the graph.

Almost there. As a final step, we will add information on the number of cylinders each car has. To do this, we will add a column to the mtcars dataframe indicating the color for each point. For good measure, we will shorten the y axis, change the drop lines to dashed lines, and add a legend.

One of R‘s most attractive features is that it allows us to manipulate output and deeply customize graphs. This article has just touched the surface. Since colors and text labels can be input as vectors, you could programmatically use them to represent almost anything. For example, point colors and/or labels could be used to highlight observations that are outliers, have high leverage, or are unusual in some other way. Simply create a vector that has colors or labels for notable observations and missing (NA) values otherwise.

No, I don’t mean late night coding. R is constantly changing – both as a language and a platform. Updates containing new functionality are frequent. New and revised packages appear several times a week. Staying current with these myriad changes can be a challenge.

In this post, I thought that I would share some of the online resources that I have found to be most useful for keeping current with what is happening in world of R.

Planet R (planetr.stderr.org) is a great site aggregor, and includes information from a wide range of sources (including CRANberries). This is my first stop for staying up on new packages.

R Bloggers (www.r-bloggers.com) is a central hub (blog aggregator) for collecting content from bloggers writing about R. It contains several new articles each day and I am addicted to it. It is a great place to learn new analytic and programming techniques.

The R Journal (journal.r-project.org) is a freely accessible refereed journal containing articles on the R project and contributed packages. This is a great way to gain deeper insight into what specific packages can do.

The Journal of Statistical Software (www.jstatsoft.org) is also a freely accessbile refereed journal and contains articles, book reviews, and code snippets on statistical computing topics. There are frequent articles about R.

Finally, R-Help, the main R mailing list (stat.ethz.ch/mailman/listinfo/r-help), is the best place to ask questions about R. Be sure to read the FAQ before posting or you may get flamed by veteran programmers. The archives are searchable and contain a wealth of information.

These are my favorites – the ones I go back to again and again. What are yours?

A common task when analyzing multi-group designs is obtaining descriptive statistics for various cells and cell combinations.

There are many functions that can help you accomplish this, including aggregate() and by() in the base installation, summaryBy() in the doBy package, and describe.by() in the psych package. However, I find it easiest to use the melt() and cast() functions in the reshape package.

As an example, consider the mtcars dataframe (included in the base installation) containing road test information on automobiles assessed in 1974. Suppose that you want to obtain the means, standard deviations, and sample sizes for the variables miles per gallon (mpg), horsepower (hp), and weight (wt). You want these statistics for all cars in the dataset, separately by transmission type (am) and number of gears (gear), and for the cells formed by crossing these two variables.