I apologize if my phrasing is un-educated, I don't have much of a stats background. I'm struggling to find info answering my question because I lack the terminology to accurately describe it.

Let's say that I was reviewing data on mp3 player sales, and wanted to see if certain combinations of categorical data resulted in returns.

Things like color, memory size, brand. I don't simply want to see the relationship between returns and the color red, but rather which combinations indicate a likely return.

For example, imagine that a certain manufacturer uses a different factory to produce each color mp3 player - and the factory that produces blue ones has a defect that shows up 3 months later and is causing returns. I'd be able to see a relationship between that color and brand with returns in that example. Does that make sense?

What test could I leverage for this? My jumping off point is a chi-square, except that I can't see the interplay between the different variables with that. I haven't been able to figure out where to go from there.

2 Answers
2

I recommend first getting a good display of the results, so that you can be oriented. To that end, I think a mosaic plot is quite effective. You take a rectangle and divide it into panels in proportion to one variable. Then subdivide each panel in proportion to the frequencies of a second variable within the first one; and so forth. Here is an example with data on survival on the Titanic by sex, class, and age. (In this diagram, green is "survived" and blue is "died".)

This looks good. I'm a little concerned with exploring the frequency, though, instead of the return rate - the results could be drowned out by higher selling models in my example. For instance, perhaps I only sell one of those defective blue ones for every 100 red ones I sell - but let's say 90% of the blue ones are returned. How can I better explore that ratio?
–
Aaron ContrerasAug 13 '14 at 14:16

They might still be visible - note the small proportion of children in my example (but also that children of crew members are NOT visible). For actual analysis, you probably should use logistic regression. Are you familiar with that? Unfortunately, software varies quite a bit on what you need to do to set it up. We can try to help though.
–
rvlAug 13 '14 at 22:03

Can I use logistic regression with nominal data? I've looked into it a bit - would my approach be to seperate each color and brand as a seperate binary variable? Red 1/0, Blue 1/0, Green 1/0, Brand_1 1/0, etc.?
–
Aaron ContrerasAug 14 '14 at 13:14

1

Yes. In regression, including logistic, the predictors can be either factors or measurements (but you need to designate the factors as class variables or make indicators out of them). It'd be more convenient to tabulate the frequencies of 0s and 1s, rather than having zillions of observations, one for each customer. That's where software differs on how you set it up. Some use counts of successes and failures, some use successes and totals, etc., so you have to make sure you're giving it the right information.
–
rvlAug 14 '14 at 15:27

1

Not sure what you mean - is it that there are some cases that don't occur in the data? Those do cause some concern in the asymptotics of the tests, just like in a chi-square situation. You kind of have to grit your teeth and don't get too literal in interpreting the results.
–
rvlAug 14 '14 at 23:46

Decision trees are a good classification and feature selection model for categorical data. They are very simple to understand and provide a good view of the relationship of attributes as you can see in the image. Each node provides a test in a selected attribute, and attributes can be aggregated. I believe this is a good starting point for you.