Patterns, vocab and practice, practice, practice

An important part of statistical analysis is being able to look at graphical representation of data, extract meaning and make comments about it, particularly related to the context. Graph interpretation is a difficult skill to teach as there is no clear algorithm, such as mathematics teachers are used to teaching, and the answers are far from clear-cut. This post is about the challenges of teaching scatterplot interpretation, with some suggestions. When undertaking an investigation of bivariate measurement data, a scatterplot is the graph to use. On a scatterplot we can see what shape the data seems to have, what direction the relationship goes in, how close the points are to the line, if there are clear groups and if there are unusual observations. The problem is that when you know what to look for, spurious effects don’t get in the way, but when you don’t know what to look for, you don’t know what is spurious. This can be likened to a master chess player who can look at a game in play and see at a glance what is happening, whereas the novice sees only the individual pieces, and cannot easily tell where the action is taking place. What is needed is pattern recognition. In addition, there is considerable room for argument in interpreting scatterplots. What one person sees as a non-linear relationship, another person might see as a line with some unusual observations. My experience is that people tend to try for more complicated models than is sensible. A few unusual observations can affect how we see the graph. There is also a contextual content to the discussion. The nature of the individual observations, and the sample can make a big difference to the meaning drawn from the graph. For example, a scatterplot of the sodium content vs the energy content in food should not really have a strong relationship. However, if the sample of food taken is predominantly fast food, high sodium content is related to high fat content (salt on fries!) and this can appear to be a relationship. In the graph below, is there really a linear relationship, or is it just because of the choice of sample?

In a set of data about fast food, there appears to be a relationship between sodium content and energy.

Students need to be exposed to a large number of different scatterplots, Fortunately this is now possible, thanks to computers. Students should not be drawing graphs by hand. So how do we teach this? I think about how I learned to interpret graphs, and the answer is practice, practice, practice. This is actually quite tricky for teachers to arrange, as you need to have lots of sets of data for students to look at, and you need to make sure they are giving correct answers. Practice without feedback and correction can lead to entrenched mistakes. Because graph interpretation is about pattern recognition, we need to have patterns that students can try to match the new graphs to. It helps to have some examples that aren’t beautifully behaved. The reality of data is that quite often the nature of measurement and rounding means that the graph appears quite different from the classic scatter-plot. The following graph has a strangely ordered look to it because the x-axis variable takes only whole numbers, and the prices are nearly always close to the nearest thousand.

The asking price of used Toyota sedans against the year of manufacture.

Students also need examples of the different aspects that you would comment on in a graph, using appropriate vocabulary. Just as musicians need to label different types of scales in order to communicate with each other their musical ideas, there is a specific vocabulary for describing graphs. Unfortunately the art of describing scatterplots is not as developed as music, and at times the terms are unclear and even used in different ways by different people. Materials produced for teacher development , available on Census @ School suggest the following things to comment on: Trend, Association, Strength, Groups and unusual observations. The following uses the framework provided by R. Kaniuk, R. ParsonageTrend covers the idea of whether the graph is linear or non-linear. I don’t really like the use of the word “trend” here, as to me it should be used for time-series data only. I would use the word “shape” in preference. It means a general tendency.Association is about the direction. Is the relationship positive or negative? For example, “as the distance a car has travelled increases, the asking price tends to decrease.” The term “tends to” is very useful here.Strength is about how close the dots are to the fitted line. In a linear model we can use correlation to quantify the strength. My experience is that students often confuse strength with slope.Groups can appear in the data, and it is much more relevant if the appearance of groups is related to an attribute of the observations. For example in some data about food values in fast food, the dessert and salad items were quite separate from the other menu items. You can see that in the graph above of food items.Unusual observations are a challenging feature of real-world data. Is it a mistake? Is it someone being silly, or misinterpreting a question? Is it not really from this population? Is it the result of a one-off rare occurrence (such as my redundancy payment earlier this year)? And what should you do with unusual observations? I’ve written a bit more about this in my post on dirty data. And there is uneven scatter, or heteroscedastiticity, which does not affect model definition, so much as prediction intervals.

On line practice works

An effective way to give students practice, with timely feedback, is through on-line materials. Graphs take up a lot of room on paper, so textbooks cannot easily provide the number of examples that are needed to develop fluency. With our on-line materials we provide many examples of graphs, both standard, and not so well-behaved. Students choose from statements about the graphs. Most of the questions provide two graphs, as pattern recognition is easier to develop when looking at comparisons. For example if you give one graph and say “How strong is this relationship?”, it can be difficult to quantify. This is made easier when you ask which of two graphs has a stronger relationship. Students get immediate feedback in a “low-jeopardy” situation. When a tutor is working one-on-one with a student, it can be worrying to the student if they get wrong answers. The computer is infinitely patient and the student can keep trying over and over until they get their answers correct, thus reinforcing correct understanding. This system and set of questions is part of our on-line resources for teaching Bivariate investigations, which occurs within the NZ Stats 3 course. You can find out more about our resources at www.statslc.com, and any teachers who wish to explore the materials for free should email me at n.petty(at)statslc.com.

6 Comments

In general I like the framework you summarize here. However, I think your description of “Trend” and “Association” is confusing, maybe misleading, and does not necessarily match the intent of K and P, as far as I can tell from their slides. Or maybe I just don’t understand the distinction that K and P are trying to draw here between “Trend” and “Association”. To me, with 2 continuous-valued variables, “Trend” and “Association” are pretty much synonymous in this context. You can have a positive linear association, a positive non-linear association, no association, a negative linear association, or a negative non-linear association. And in some cases, such as a “U shaped” association, it is difficult to know whether to call it positive or negative; it might be positive over part of the domain and negative (or indeed flat) over a different part. To say that “association” is about “direction” just doesn’t make sense to me. A direction word, such as positive or negative, can modify “association”, but that doesn’t make “association” and “direction” synonyms. You can also have “no association”. If you want a word for direction, why not use “direction”? Likewise (as you suggest), if you want to emphasize the distinction between linear and non-linear associations, “shape” makes a lot more sense than “trend”. Yes, you can have a “linear trend” or a “non-linear trend”. But that doesn’t mean you can define “trend” to mean the linearity or non-linearity of something. This is simply not logical. In my experience, “trend” and “association” part company – that is, cease being interchangeable – when 1 or both of your variables is categorical. In this case, we still can talk about associations (though “direction” of association may or may not be a meaningful concept), but we never use the word “trend” in such contexts. Don’t get me wrong. I agree completely with the importance of teaching people how to interpret scatterplots, and appreciate your blog entry. And for that matter, it’s also important to teach people when to use scatterplots in the first place. I work with many professional scientists who muddle through their data looking at bar plots of one variable at a time, never even thinking to create some scatterplots. And the framework is great. But using “association” and “trend” in such confusing ways is not going to be helpful. My modest proposal for a framework: association, shape, direction, strength, groups, unusual observations.

Thank you for that, Scott. I had found the trend/association distinction difficult to get my head around too, and you have put your finger on the problem. So do we need ‘association’ at all, or would a better framework by shape, direction, strength, groups, unusual observations? What is association talking about that isn’t covered in shape and direction?

I do think that “association” is a key concept worthy of an identifier. It gets at the general question of whether or not 2 variables are related – that is, whether or not the value of x tells you anything about the distribution of y (or vice versa) – regardless of the particular form of the potential relationship. It involves, of course, more precise ideas such as statistical independence, conditional distributions, correlation, maybe even causation. But more simply put, you can indeed think about whether or not one variable “tends to” do anything at all as the other one changes. Only once you have an (at least tentative) answer to that question does it make sense to start thinking about shape, direction, and strength of the association.

[…] Patterns, vocab and practice, practice, practice An important part of statistical analysis is being able to look at graphical representation of data, extract meaning and make comments about it, pa… […]