Why We Care about the Shape of the Data

For the last post in this three-part series, we are going to revisit the standard from the first post:

6.SP.A.1: Develop understanding of statistical variability: Recognize a statistical question as one that anticipates variability in the data related to the question and accounts for it in the answers.

In the first post, I talked about how to support kids to think about and describe variability, and also about the ways kids commonly represent these ideas. In the second post, I argued that kids can engage with inventing a data display that would help their readers better understand the variability in the data, and that teachers can use the sensible choices students make while inventing to help them understand the rationale behind conventional data displays, like the histogram.

This last post is focused on the crux of these early activities. It is a theme that runs throughout the discussions on variability and the data display activities. It is the answer to the question I posed in the first post: How can we say anything about the world when everything around us is always changing?

These are just guesses

Remember that the students collected their data with a seemingly simple question in mind: what is the distance around the fountain? However, when they see the large amount of variability in their individual measurements, many students determine that they don’t really know anything about the true distance around the fountain. Since there is so much uncertainty, they are unwilling to make a claim about the “true” distance.

In fact, some students abandon the concept of data as measurements all together. Students sometimes begin to call the measurements “opinions” or “guesses” and claim that we can’t make any claims based on the data, but instead need to “find a better measuring tool and go measure the fountain again.” To these students, the uncertainty caused by the variability in “their” data/measurements undermines any claim about the fountain, because you can never be certain.

It is important for teachers to let students wrestle with this idea for an extended amount of time. We are too often tempted to “correct” an idea that is unconventional. However, if students are to really understand the game of making knowledge claims about variable data, they have to understand that claims made from data will always have a degree of uncertainty. They have to embrace the uncertainty for themselves. They have to see the epistemic (see the second post in this series for more on the use of the word “epistemic”) nature of tools like data displays. And it might take them more than a few minutes, hours, or even days, to see that the “we don’t know anything” position is not particularly helpful in a world of ubiquitous variability. Let them wrestle with it so they can see.

The data set has shape

As students share their invented displays, the teacher should help them to see that their display choices create a “shape” for the data. In the case of measurement data, students often describe the shape using words like “hill” or “mountain” or “skyscraper” or “tower.” They talk about the data points going “up, up, up” and then “down, down, down.” They talk about some values being “clumped” together while also describing the extreme values using words like “outlier” or “bad mistakes.” Developing students’ thoughts about data shape are crucial to developing their thoughts about the claims we can make.

The shape of the data subtly suggests that there is an underlying probability structure to the measurement process. Yep, probability. If you are thinking, “wait, probability is supposed to be all about flipping coins and spinning spinners and not measurements and data displays,” then think for a moment about what you see in the data display above. Most will see this and determine that the measurements in the middle “hill” are more likely close to the true circumference of the fountain than the extreme measurements far away from the center. If you imagine that we send another group of 6th graders to measure the same fountain you’ll almost certainly expect to see more measurements between 1000 cm and 1400 cm, with very few measurements that are 500 cm or below. In fact, it seems possible that the new group of students would not produce any measurements that less than 500 cm, but it seems very unlikely that the group wouldn’t produce a measurement between 1000 cm and 1400 cm.

What leads us to make these claims? The shape of the data and the probability structure it suggests. It suggests that the values in the “hill” are more likely than values far from center to be observed over and over again if many groups of students measured the same fountain. Although students don’t calculate probability at this point, they do make informal claims about it. Their claims are very similar to the comments that I made in the previous paragraph about things being more or less “likely” based on the data. Have you ever wondered why we value the center of this shape so much? It is because we expect to observe these values more often if the process is repeated, and this higher probability of observing particular values is what motivates our claims about the distance around the fountain.

Shape is important. Seeing shape is critical for students to begin to look at the data set as a collection, rather than a set of individual values, or “guesses.” This road from “we don’t know anything” to “maybe we know something” must go through shape.

Almost a certain fact

After students developed competencies in displaying data we asked them to re-measure the fountain using better tools, and being as careful as possible. The top data set above is a display of the first measurements, and the bottom one is a display of the second. Many of the students will be SHOCKED that they still have variability in their measurements! They were so careful! However, they also use their newly developed eye for looking at the shape of a set of data to make some very important observations.

First, the shape of the data is much less spread out, indicating that they were much more precise as a group the second time they measured. Second, even though they still had variability in their measurements, both of the two distributions are centered around 1200 cm. The accumulating experience with variability, data shape, and implicit probability structures motivates many students at this point to make very strong claims about the “real” circumference of the fountain. Some will say that it is “almost a certain fact” that the fountain has a circumference very close to 1200 cm. Rarely will students still claim that we don’t know anything. However, there still might be significant disagreement among students about the degree of certainty associated with the claims. You might have quite an argument on your hands about the use of the word “fact”!

At this point, students will have developed a very strong foundation of thinking about statistical variability and looking for shape in data to make claims about the question that motivated the data. They will be poised to learn about statistics as epistemic tools. For example, you can easily nudge them toward measures of center by asking students to provide a method for calculating their “very best guess” of the true distance around the fountain. Just as important is the easy nudge toward measures of variability by asking them to calculate a number that would describe “how precise” the group of measurers was.

The point behind all of this is that instruction in modeling data with mathematical tools should be rooted in giving kids a chance to taste, describe, and represent variability so they can see statistics and probability as tools to make claims about questions in the midst of uncertainty.