Wednesday, January 23, 2013

In 2010, I wrote a short blog item about Florence Nightingale the statistician, solely because of its novelty value. I didn't even bother to look closely at the associated graphic she designed, but that's what I intend to do here. In this first installment, I reflect on her famous data visualization by reconstructing it with the modern tools available in R. In part two, I will use the insight gained from that exercise to go beyond data presentation to potentially more revealing data modeling. Interestingly, I suspect that much of what I will present could also have been accomplished in Florence Nightingale's day, more than 150 years ago, albeit not as easily and not by her alone.

Figure 1. Nightingale and her data visualization (click to enlarge)

Although Florence Nightingale was not formally trained as a statistician, she apparently had a natural aptitude for mathematical concepts and evidently put a lot of thought into presenting the import of her medical findings in a visual way. Click on Figure 1 to enlarge it and view the details in her original graphic. As a consequence, she was elected the first female member of the Royal Statistical Society in 1859 and later became an honorary member of the American Statistical Association.

Why Wedges?

Why did FN bother to construct the data visualization in Figure 1? If you read her accompanying text, you see that she refers to the sectors as wedges. In a nutshell, her point in devising Figure 1 was to try and convince a male-dominated, British bureaucracy that better sanitary methods could seriously diminish the adverse impact of preventable disease amongst military troops on the battlefield. The relative size of the wedges is intended to convey that effect. Later on, she promoted the application of the same sanitation methodologies to public hospitals. She was using the established term of the day, zymotic disease, to refer to epidemic, endemic, and contagious diseases.

Today, it is hard for us to fully appreciate how innovative her ideas were at that time, and the resistance with which they were met. Her methods for disease prevention were highly contentious in the mid nineteenth century (think: Germ theory). The social historian, Hugh Small, tells us in "Did Nightingale’s ‘Rose Diagram’ or ‘Coxcomb’ save millions of lives?" that even with her famous wedge visualization, conveying the message of sanitary reform was not smooth sailing. In fact, it seems that something akin to modern blog wars broke out, with many pro and con pamphlets being published about FN's methods. She created Fig. 1 in that milieu.

The visual message of FN's diagram was essentially this. Based on data collected during the Crimean War, the large, outer, gray wedges on the right circular diagram in Fig. 1 represent the increasing level of disease before the introduction of her sanitary methods. The circular diagram on the left side represents the diminishing level of disease in the following year, after the introduction of FN's methods. In her visualization, bigger is badder (i.e., smaller is better).

Of course, you actually have to read the annotations on her diagrams to understand that she has positioned the data as though they were on a clock face, that her wedges should therefore be read clockwise, that she has related the two diagrams in a certain way, and so forth. It's all very compact and has to be read carefully. But, as the FN story demonstrates, even if you have created what you think is a good visualization (usually PerfViz, in our case), it may not be received in the way you were expecting. If that happens, you will need to keep rethinking your visual paradigm and possibly go even further. How to go further is what I shall consider in this and the subsequent blog post.

Hereafter, I'm going to take the liberty of calling Fig. 1 a Cam Diagram because her term "wedge" is now applied to pie charts; which FN's diagrams most definitely are not. A pie chart is a circular diagram with fixed radius and wedges with different angles representing the relative magnitudes of the data. Instead, FN's diagrams are equiangular sectors with variable radii representing the magnitudes. They remind me of a cam: a kind of irregular gear wheel with teeth of varying radii that move an associated lever differently as it rotates. See Fig. 2.

Figure 2. Cams

Calling it a Cam diagram seems to me to be no worse than the now ambiguous Wedge diagram, the even more implausible Rose diagram (with non-overlapping petals?) or the ghastly Coxcomb diagram (which isn't even round on a bird's head). Moreover, as far as I can determine, no one has previously used the term "cam" for any similar diagrams. But I won't be putting any money on my term sticking.

I ran into a number of issues with Plotrix while putting the above R code together:

The main argument does not work. It never appears in the R-function code body.

Rotation of the zymotic data to visually match the original FN diagram orientation can be accomplished with the start argument. It certainly rotates the data but not the radial axis labels (i.e. the dates). So, the data can end up out of alignment with the axes. This lack of coordination could easily escape your attention.

There is a strange 'O' character that appears on the left-hand side of individual plots. It may be an attempt to indicate the origin axis.

Using mfrow=c(1,2) to get the cam plots side-by-side like Fig. 1, there seems to be two labels denoted by 'index' in a second undefined row. This may be related to the 'O' strangeness.

If you use par(cex.lab=0.5) for the axis labels, you are advised to call dev.new() to reset the viewport correctly when outputting successive plots. Otherwise, you may find the labels messed up on successive plots.

I managed to hack around these problems to produce the result in Fig. 3, which is sufficient for what I want to do.

Figure 3. Cam plots corresponding to Figure 1

Several important differences emerge as a consequence of applying modern data visualization tools, like Plotrix, to FN's data:

The BEFORE data are on the left, the AFTER data are on the right of Fig. 3. This is consistent with the universal convention of time flowing from left to right, but opposite from the order used by FN in Fig. 1. FN's choice is no longer an option today. Your audience will very likely be thrown off your point if you violate that visual expectation.

The circular paradigm is that of a 12-hour clock.
However, instead of starting in the 12 o'clock high position, for example, FN starts at the 9 o'clock position (denoted by an X in Fig. 3). The data is then read clockwise in both Figs. 1 and 3.

Why did FN start in the 9 o'clock position? In fact, there are two 9 o'clock positions, one on each cam plot, corresponding to the X marks in Fig. 3. However, these are really the same point. But since FN chose to employ two 12-month clocks, as it were, she is left the problem of tying them together.
Having the wedges overlap on a single clock would be too visually disruptive so, she introduces a horizontal elbow-line. (not shown in Fig. 3) My guess is that her choice had to do with using landscape rather than portrait layout. In other words, tying the two clocks together in a horizontal sequence. This choice might have been made as a convenience for printing or simple readability: we read horizontal rows of characters in English.

FN probably positioned her BEFORE cam diagram on the right-hand side because when you've gone around once, you are back in the 9 o'clock position.
If she had used the relative positioning in Fig. 3, that would put you to the far west of the 9 o'clock position of the AFTER data. By reversing the BEFORE and AFTER positions of the two "clocks" in Fig. 1, FN only needed to add a thin elbow-line to join the two 9 o'clock positions without crossing over any wedges. It's the least visually disruptive solution for two 12-hour cam plots.

The AFTER cam plot in Fig. 3, shows the data sectors at the same visual scale as the BEFORE data. The "smaller is better" cue of Fig. 1 is lost. It's a consequence of sometimes having less convenient control with modern tools. The fact that the magnitudes are actually smaller is indicated by the numerical radius scale, which has to be consciously read—just like one has to read the dates in Fig. 1.

The radial magnitude is in proportion to the square root of the sector area. This is not apparent in FN's cam plots and there is no way to tell without numerical indicators. From a purely visual standpoint, you don't need to know. But more on that below.

In principal, it would be possible to rescale the AFTER sectors in Fig. 3 to match the AFTER sectors in Fig. 1, but it's non-trivial to do using radial.pie() because of the different operating assumptions in modern tools.

Having noted all those differences, a more important question would seem to be: why do we need two plots? A better solution for comparing the relative magnitudes would seem to be to plot both data sets (over 24 months) on a single cam diagram like Figure 4. This alternative is easily achieved with radial.pie().

Figure 4. Combined 24 month cam diagram

No doubt FN had her reasons for presenting her data the way she did and she certainly deserves all the credit she gets for doing that a century and a half ago. But it's clear from Fig. 4 that both 12 month periods can be plotted on a single 24-month cam diagram with the same or better visual effect; especially when accented with colors (in a manner only slightly different from Fig. 1). And since it then resembles a 24-hour clock (military time), we can start at the top in Fig. 4 (denoted by an X), rather than the more arbitrary 9 o'clock position in Figs. 1 and 3. I also tried using clock24.plot() but I found radial.pie() had more flexibility.

Personally, I now have to wonder if the single 24-hour cam diagram in Figure 4 might not have helped FN make her point even more simply and directly. But, I'll leave that for the historians to argue over.

From the discussion so far, we now see that there are several choices available for clock-style layouts, of which FN chose one. As mentioned earlier, without a numerical scale in Fig. 1, it is not obvious that the radial lengths of FN's wedges are not equal to the disease intensity. Understanding this point is important for appreciating the visual message FN wanted to convey, and it will also provide us with a convenient segue for going beyond FN's choice of visualization.

Square Root of a Cam

The radial lengths in Fig. 1 (and Fig. 3) are not proportional to the data magnitudes, as they would be in a modern polar plot, for example. Instead, the data magnitudes determine the area of the fixed-angle wedges. In this section, I will explain why FN did that.

In Fig. 5, all the shapes have the same area, viz., $A =16$ square units. The red column has a width $w_{col} = 1$ unit and it's height is $h_{col} = 16$ units. The corresponding area is therefore $A_{col} = w_{col} \times h_{col}$ or 16 square units. This is the situation with all bar charts and histograms where the columns each have the same default width. The height of a column is proportional to its area: bigger datum, taller column. The problem with columns is that data with large variations in height tend to swamp those columns belonging to more moderately varying data.

One way to combat that bias is to represent the areas by squares. In Fig. 5, the green square has exactly the same area as the red column. However, since the square is broader than the column, $w_{sq} = 4$ units, it's height is only $h_{sq} = 4$ units. In other words, the height of the green square is proportional to the square root of its area (16 square units). Being more squat, this solves the problem of excessively high columns. On the other hand, it introduces the new problem of displaying columns with very different widths. Visual comparison along the x-axis now becomes difficult.

A compromise between these two cases, and the one that FN chose, is to use sectoral areas. In Fig. 5, the blue sector has the same area as both the red column and the green square. Clearly, the sector is taller than the square, but it is much shorter than the column. In Fig. 1, each of FN's wedges represents one twelfth of the circle (like one hour on a clock face) in order to accommodate 12 months of data. Each sector has the same fixed angle: $\theta = 360/12$ or 30 degrees. For a given area (datum magnitude) and a fixed angle of 30 degrees, all that remains to be determined is, how high should the sector be?

You can see from Fig. 5 that the arc width at the top of the blue sector is about the same as the width of the green square, viz., 4 units. From there, the sector has to taper down to zero width at the x-axis. The arc width is $r \theta$ for a radius $r$ and angle $\theta$. The area of the sector is given by the radial distance from the x-axis to the arc, i.e., the height, multiplied by the width of the arc:
\begin{equation}
A_{sec} = \dfrac{1}{2} \, r \times r \theta \label{eqn:asec}
\end{equation}
Wait! Where did that factor of a half come from?

Assuming an arc width of 4 units and positioning it at a height of 8 units above the x-axis would produce a rectangular area of $8 \times 4 = 32$ square units. But, because the sector tapers at the bottom, the blue area is actually closer to a triangle with half the area of the assumed rectangle. Based on this approximation, the area is:
\begin{equation}
A_{sec} \simeq \dfrac{1}{2} \times 8 \times 4 = 16 ~\text{sq. units}
\end{equation}
Using eqn. \eqref{eqn:asec}, the exact area is calculated as:
\begin{equation}
A_{sec} = \dfrac{1}{2} \times 7.81764 \times 4.09331 = 16.0 ~\text{sq. units}
\end{equation}
The actual value of the radius, $r = 7.82$, is less than 8 units because the arc width, $r\theta = 4.09$, is slightly longer than the width of the green square—it's curved.

Whereas the height of a square is given by the square root of its area, eqn. \eqref{eqn:asec} tells us that the radial height of a sector is given by the square root of its angular area.

A sector (an FN wedge) provides a nice compromise between variable-height columns and variable-width squares. However, it's also clear from Fig. 5 that a linear array of sectors would present problems when it comes to registering them numerically with the x-axis. This problem is easily overcome by simply abutting all the sectors into a circle. Hence, the circular distribution of wedges in Figs. 1, 3 and 4.

Summary

The point of all this is to convince ourselves that FN applied sectoral areas, instead of classic pie chart wedges (pie charts indeed did exist in her day) or histogram columns, so as to reduce the visual impact of high variance in her data. In particular, she wanted to counter any criticism that diminishing zymotic disease was due to other things, like seasonal effects (e.g., the onset of spring weather) and not her sanitation methodologies.
The square-root attenuation derived from employing sectoral areas in Fig. 1 accomplished that. This could also have been accomplished with a single cam diagram like Fig. 4.

However, as I plan to show in the next installment, there are other ways to reduce the visual impact of large data-variance that also lead to a deeper insight into the underlying dynamics of FN's data.