Saturday, July 21, 2012

Factor Analysis for Likert/Ordinal/Non-normal Data

My friend Jeremy Miles sent me this article by Basto and Periera (2012) this morning with the subject line ‘this is kind of cool’. Last time I saw Jeremy, my wife and I gatecrashed his house in LA for 10 days to discuss writing the R book that’s about to come out. During that stay we talked about lots of things, none of which had anything to do with statistics, or R for that matter. It’s strange then that with the comforting blanket of the Atlantic ocean between us, we only ever talk about statistics, or rant at each other about statistics, or R, or SPSS, or each other.

Nevertheless, I’m always excited to see a message from Jeremy because it’s usually interesting, frequently funny, and only occasionally insulting about me. Anyway, J was right, this article was actually kind of cool (in a geeky stats sort of way). The reason that the article is kind of cool is because it describes an SPSS interface for doing various cool factor analysis (FA) or principal components analysis (PCA) things in SPSS such as analysis of correlation matrices other than those containing Pearson’s r and parallel analysis/MAP. It pretty much addresses two questions that I get asked a lot:

My data are Likert/not normal, can I do a PCA/FA on them?

I’ve heard about Velicer’s minimum average partial (MAP) criteria and Parallel analysis, can you do them in SPSS.

PCA/FA is not something I use, and the sum total of my knowledge is in my SPSS/SAS/R book. Some of that isn’t even my knowledge, it’s Jeremy’s, because he likes to read my PCA chapter and get annoyed about how I’ve written it. The two questions are briefly answered in the book, sort of.

The answer to question 1 is apply the PCA to the correlation matrix of polychoric correlations (for Likert/ordinal/skewed data) or tetrachoric correlations (for dichotomous data) rather than the matrix of Pearson’s r. This is mentioned so briefly that you might miss it on p. 650 of the SPSS book (3rd ed) and 772 (in the proofs at least) of the R book.

The answer to question 2 is in Jane Superbrain 17.2 in the books, in which I very briefly explain parallel analysis and point to some syntax to do it that someone else wrote, and I don’t talk about MAP at all.

I cleverly don’t elaborate on how you would compute polychoric correlations, or indeed tetrachoric ones, and certainly don’t show anyone anything about MAP. In part this is because the books are already very large, but in the case of the SPSS book it’s because SPSS won’t let you do PCA on any correlation matrix other than one containing Pearson’s r and MAP/parallel analysis let’s just say have been overlooked in the software. Until now that is.

Basto and Periera (2012) have written an interface for doing PCA on correlation matrices containing things other than Pearson’sr, and you can do MAP parallel analysis and a host of other things. I recommend the article highly if PCA is your kind of thing.

However, the interesting thing is that underneath Basto and Periera’s interface SPSS isn’t doing anything – all of the stats are computed using R. In the third edition of the SPSS book I excitedly mentioned the R plugin for SPSS a few times. I was mainly excited because at the time I’d never used it, and I stupidly thought it was some highly intuitive interface that enabled you to access R from within SPSS without knowing anything about R. My excitement dwindled when I actually used it. It basically involves installing the plugin which may or may not work. Even if you get it working you simply type:

Input Program R

End Program

and stick a bunch of R code in between. It seemed to me that I might as well just use R and save myself the pain of trying to locate the plugin and actually get it working (it may be better now – I haven’t tried it recently). Basto and Periera’s interface puts a user-friendly dialog box around a bunch of R commands.

I’m absolutely not knocking Basto and Periera’s interface – I think it will be incredibly useful to a huge number of people who don’t want to use R commands, and very neatly provides a considerable amount of extra functionality to SPSS that would otherwise be unavailable. I’m merely making the point that it’s shame that having installed the interface, SPSS will get the credit for R’s work.

Admittedly it will be handy to have your data in SPSS, but you can do this in R with a line of code:

data<-read.spss(file.choose(), to.data.frame = TRUE)

Which opens a dialog box for you to select an SPSS file, and then puts it in a data frame that I have unimaginatively called ‘data’. Let’s imagine we opened Example1.sav from Basto and Periera’s paper. These data are now installed in an object calleddata.

Installing the SPSS interface involves installing R, installing the R plugin for SPSS, then installing the widget itself through the utilities menu. Not in itself hard, but by the time you have done all of this this I reckon you could type and execute this command in R:

rMatrix<-hetcor(data)$correlations

This creates a matrix (called rMatrix) containing polychoric correlations of the variables in the data frame (which, remember, was called data).

You probably also have time to run these command:

parallelAnalysis<-nScree(rMatrix)

parallelAnalysis

That’s your parallel analysis done on the polychoric correlation matrix (first command) and displayed on the screen (second command). The results will mimic the values in Figure 4 in Basto and Periera. If you want to generate Figure 3 from their paper as well then execute:

plotnScree(parallelAnalysis)

It's not entirely implausible that the R plugin for SPSS will still be downloading or installing at this point, so to relieve the tedium you could execute this command:

mapResults<-VSS(rMatrix, n.obs = 590, fm = "pc" )

mapResults

That’s the MAP analysis done on the polychoric correlation matrix using the VSS() function in R. n.obs is just the number of observations in the dataframe, and fm = “pc” tells it to do PCA rather than FA. The results will mimic the values in Figures 5 and 6 of Basto and Periera.

The R plugin isn't working so you're frantically googling for a solution. While you do that, a small gerbil called Horace marches through the cat flap that you didn't realise you had, jumps on the table and types this into the R console:

PCA<-principal(rMatrix, nfactors = 4, rotate = "varimax")

print.psych(PCA, cut = 0.3, sort = TRUE)

Which will create an object called OPCA which contains the results of a PCA on your polychoric correlation matrix, extracting 4 factors and rotating using varimax (as Basto and Periera do for Example 1). You’ll get basically the same results as Figure 13.

You probably also have time to make some tea.

Like I said, my intention isn’t to diss Basto and Periera’s interface. I think it’s great, incredibly useful and opens up the doors to lots of useful things that I haven’t mentioned. My intention is instead to show you that you can do some really complex things in R with little effort (apart from the 6 month learning curve of how to use it obviously). Even a gerbil called Horace can do it. Parallel analysis was done using 33 characters: a line of code. PCA on polychoric correlations sounds like the access code to a secure unit in a mental institution, but it’s three short lines of code. Now that is pretty cool.