Crosstabs using RevoScaleR

03/17/2016

Tiempo de lectura: 19 minutos

Colaboradores

En este artículo

Crosstabs, also known as contingency tables or crosstabulations, are a convenient way to summarize cross-classified categorical data—that is, data that can be tabulated according to multiple levels of two or more factors. If only two factors are involved, the table is sometimes called a two-way table. If three factors are involved, the table is sometimes called a three-way table.

For large data sets, cross-tabulations of binned numeric data, that is, data that has been converted to a factor where the levels represent ranges of values, can be a fast way to get insight into the relationships among variables. In RevoScaleR, the rxCube function is the primary tool to create contingency tables.

For example, the built-in data set UCBAdmissions includes information on admissions by gender to various departments at the University of California at Berkeley. We can look at the contingency table as follows:

(Because cross-tabulations are explicitly about exploring interactions between variables, multiple predictors must always be specified using the interaction operator ":", and not the terms operator "+".)

This data set is widely used in statistics texts because it illustrates Simpson’s paradox, which is that in some cases a comparison that holds true in a number of groups is reversed when those groups are aggregated to form a single group. From the preceding table, in which admissions data is aggregated across all departments, it would appear that males are admitted at a higher rate than women. However, if we look at the more granular analysis by department, we find that in four of the six departments, women are admitted at a higher rate than men:

Letting the Data Speak Example 1: Analyzing U.S. 2000 Census Data

The CensusWorkers.xdf data set contains a subset of the U.S. 2000 5% Census for individuals aged 20 to 65 who worked at least 20 weeks during the year from three states. Let’s examine the relationship between wage income (represented in the data set by the variable incwage) and age.

A useful way to observe the relationship between numeric variables is to bin the predictor variable (in our case, age), and then plot the mean of the response for each bin. The simplest way to bin age is to use the F() wrapper within our initial formula; it creates a separate bin for each distinct value of age. (More precisely, it creates a bin of length one from the low value of age to the high value of age—if some ages are missing in the original data set, bins are created for them anyway.)

As we wanted, the table contains average values of incwage for each level of age. If we want to create a plot of the results, we can use the rxResultsDF function to conveniently convert the output into a data frame. The F_age factor variable will automatically be converted back to an integer age variable. Then we can plot the data using rxLinePlot:

Transforming Data

Because crosstabs require categorical data for the predictors, you have to do some work to crosstabulate continuous data. In the previous section, we saw that the F() wrapper can do a transformation within a formula. The transforms argument to rxCrossTabs can be used to give you greater control over such transformations.

For example, the kyphosis data from the rpart package consists of one categorical variable, Kyphosis, and three continuous variables Age, Number, and Start. The Start variable indicates the topmost vertebra involved in a certain type of spinal surgery, and has a range of 1 to 18. Since there are 7 cervical vertebrae and 12 thoracic vertebrae, we can specify a transform that classifies the start variable as either cervical or thoracic as follows:

From these, we see that the probability of the post-operative complication Kyphosis seems to be greater if the Start is a cervical vertebra and as more vertebrae are involved in the surgery. Similarly, it appears that the dependence on age is non-linear: it first increases with age, peaks in the range 5-9, and then decreases again.

Cross-Tabulation with rxCrossTabs

The rxCrossTabs function is an alternative to the rxCube function, which performs the same calculations, but displays its results in format similar to the standard R xtabs function. For some purposes, this format can be more informative than the matrix-like display of rxCube, and in some situations can be more compact as well.

You can see, for example, that in Department A, 62 percent of male applicants are admitted, but 82 percent of female applicants are admitted, and in Department B, 63 percent of male applicants are admitted, while 68 percent of female applicants are admitted.

A Large Data Example

The power of rxCrossTabs is most evident when you need to tabulate a data set that won’t fit into memory. For example, in the large airline data set AirOnTime87to12.xdf, you can obtain the mean arrival delay by carrier and day of week as follows (if you have downloaded the data set, modify the first line as follows to reflect your local path):

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

Using Sparse Cubes

An additional tool that may be useful when using rxCube and rxCrossTabs with large data is the useSparseCube parameter. Compiling cross-tabulations of categorical data can sometimes result in a large number of cells with zero counts, yielding at its core a “sparse matrix”. In the usual case, memory is allocated for every cell in the cube, but large cubes may overwhelm memory resources. If we instead allocate space only for cells with positive counts, such operations may often proceed successfully.

As an example, let’s look at the airline data again and construct a case where the cross-tabulation yields many zero entries. As the overwhelming number of flights in the data set were not canceled, by appending the Cancelled predictor in the formula, we would expect a large number of categorical predictor combinations to have zero observations. Because the Cancelled predictor is a logical rather than a factor variable, we need to use the F(.) function to convert it.

While this particular example will likely run successfully to completion even on a minimally equipped modern computer without setting the useSparseCube flag to TRUE, it illustrates how one can quickly start to see the number of zero entries accumulate in an rxCube computation. With larger data sets and a larger number of categorical variable combinations, however, this setting may allow computations of cubes that would not otherwise fit in memory.

For the rxCrossTabs function, the useSparseCube option works exactly the same internally. However, because rxCrossTabs always returns a table, it may require more memory to format its result than rxCube. If you have an extremely large contingency table, we recommend rxCube with useSparseCube=TRUE for the greatest chance of completing the computation. The useSparseCube flag may also be used with rxSummary.

Tests of Independence on Cross-Tabulated Data

One common use of contingency tables is to test whether the tabulated variables are independent. RevoScaleR includes several tests of independence, all of which expect data in the standard R xtabs format. You can get data in this format from the rxCrossTabs function by using the argument returnXtabs=TRUE:

rxKendallCor: performs a Kendall tau test of independence. There are three flavors of test, a, b, and c; by default, the b flavor, which accounts for ties, is used.

(In fact, regular rxCrossTabs or rxCube output can be used as input to these functions, but they are converted to xtabs format first, so it is somewhat more efficient to have rxCrossTabs return the xtabs format directly.)

Here we use the arrDelayXTab data created preceding and perform a Pearson’s chi-squared test of independence on it:

For large contingency tables such as this one, the chi-squared test is the tool of choice. For smaller tables, particularly those with cells with expected counts fewer than five, Fisher’s exact test is useful. On a large table, however, Fisher’s exact test may not be an option. For example, if we try it on our airline table, it returns an error:

In both cases, we are given indisputable evidence of the independence of our two predictor factors. For this example, we could have as easily used the standard R functions chisq.test and fisher.test. The RevoScaleR enhancements, however, permit rxChisSquaredTest and rxFisherTest to work on xtabs objects with multiple tables. For example, if we expand our examination of the admissions data to include the department info, we obtain a multi-way contingency table:

Like Fisher’s exact test, the Kendall tau correlation test works best on smaller contingency tables. Here is an example of what it returns when applied to our admissions data (the results differ from run to run as the underlying algorithm relies on sampling):

Odds Ratios and Risk Ratios

Another common task associated with 2 x 2 contingency tables is the calculation of odds ratios and risk ratios (also known as relative risk). The two functions rxOddsRatio and rxRiskRatio in RevoScaleR can be used to compute these quantities. The odds ratio and the risk ratio are closely related: the odds ratio computes the relative odds of an event among two or more groups, while the risk ratio computes the relative probabilities of an event. Consider again the contingency table admissCTabs:

In this example, the odds of being admitted as a male are 1198/1493, or about 4 to 5 against. The odds of being admitted as a female are 557/1278, or about 4 to 9 against. The odds ratio is (1198/1493)/(557/1278), or 1.8 greater odds that a male will be admitted as opposed to a woman.

The risk ratio, by contrast, compares the probabilities of being rejected, that is, 1493/(1198+1493) for a man versus 1278/(557+1278) for a woman. So here the risk ratio is 0.697 (the probability of a woman being rejected) divided by 0.555 (the probability of a man being rejected), or 1.255: