In my last post, I described a bar-density chart to show paired data of proportions with an 80/20-type rule. The following example illustrates that a small proportion of Youtubers generate a large proportion of views.

In all these examples, the message of the data is the importance of a small number of people (top earners, superstars, bandwidth hogs). A good visual should call out this message.

The bar-density plot consists of two components:

the bar chart which shows the distribution of the data (views, wealth, income, bandwidth) among segments of people;

The embedded Voronoi diagram within each bar that encodes the relative importance of each people segment, as measured by the (inverse) density of the population among these segments - a people segment is more important if each individual accounts for more of the data, or in other words, the density of people within the group is lower.

The bar chart can adopt a more conventional horizontal layout.

Voronoi tessellation

To understand the Voronoi diagram, think of a fixed number (say, 100) of randomly placed points inside a bar. Then, for any point inside the bar area, it has a nearest neighbor among those 100 fixed points. Assign every point on the surface to its nearest neighbor. From this, one can draw a boundary around each of the 100 points to include all its nearest neighbors. The resulting tessellation is the Voronoi diagram. (The following illustration comes from this AMS column.)

The density of points in the respective bars encodes the relative proportions of people within those groups. For my example, I placed 6 points in the red bar, 666 points in the yellow bar, and ~2000 points in the gray bar, which precisely represents the relative proportions of creators in the three segments.

Density is represented statistically

Notice that the density is represented statistically, not empirically. According to the annotation on the original chart, the red bar represents 14,000 super-creators. Correspondingly, there are 4.5 million creators in the gray bar. Any attempt to plot those as individual pieces will result in a much less impactful graphic. If the representation is interpreted statistically, as relative densities within each people segment, the message of relative importance of the units within each group is appropriately conveyed.

A more sophisticated way of deciding how many points to place in the red bar is to be developed. Here, I just used the convenient number of 6.

The color shades are randomly applied to the tessellation pieces, and used to facilitate reading of densities.

***

In this section, I provide R code for those who want to explore this some more. This is code used for prototyping, and you're welcome to improve them. The general strategy is as follows:

Set the rectangular area (bar) in which the Voronoi diagram is to be embedded. The length of the bar is set to the proportion of views, appropriately scaled. The code utilizes the dirichlet function within the spatstat package to generate the fixed points; this requires setting up the owin parameter to represent a rectangle.

Set the number of points (n) to be embedded in the bar, determined by the relative proportion of creators, appropriately scaled. Generate a data frame containing the x-y coordinates of n randomly placed points, within the rectangle defined above.

Use the ppp function to generate the Voronoi data

Set up a colormap for plotting the Voronoi diagram

Plot the Voronoi diagram; assign shades at random to the pieces (in a production code, these random numbers should be set as marks in the ppp but it's easier to play around with the shades if placed here)

The code generates separate charts for each bar segment. A post-processing step is currently required to align the bars to attain equal height. I haven't figured out whether the multiplot option helps here.

library(spatstat)

# enter the scaled proportions of creators and views# the Youtube example has three creator segments

# number of randomly generated points should be proportional to proportion of creators. Multiply nc by a scaling factor if desired

nc = c(3, 33, 965)*2

# bar widths should be proportional to proportion of views# total width should be set based on the width of your page

# because of random points, the tessellation looks different each time# post-processing: make each bar the same height when aligned side by side

***

A cousin of the bar-density plot is the pie-density plot. Since I'm using only three creator segments, which each account for about 30-40% of the total views, it is natural to use a pie chart. In this case, we embed the Voronoi diagrams into the pie sectors.

If the distribution were more even, that is to say, the creators are more or less equally important, the pie-density plot looks like this:

***

Something that is more like 80/20

The original chart shows the top 0.3 percent generating almost 40 percent of the views. A more typical insight is top X percent generates 80 percent of the data. For the YouTube data, X is 11 percent. What does the pie-density chart look like if top 11 percent <-> 80 percent, middle 33 percent <-> 11 percent, bottom 56 percent <-> 8 percent?

Roughly speaking, the second segment includes 3 times the people as the largest, and the third has 5 times as the largest.

Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.

In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.

I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.

I like the chart title that explains what it is about.

Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.

***

My initial reaction on Twitter is to use a mirrored bar chart, like this:

I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.

Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.

What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.

Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.

The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.

For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)

Here is what the bar-density plot looks like when the distribution is essentially uniform:

The density inside each bar is roughly the same, indicating that the creators are roughly equally important.

P.S.

1) The next post on the bar-density plot, with some experimental R code, will be available here.

If you’re like me, your first exposure to data visualization was as a consumer. You may have run across a pie chart, or a bar chart, perhaps in a newspaper or a textbook. Thanks to the power of the visual language, you got the message quickly, and moved on. Few of us learned how to create charts from first principles. No one taught us about axes, tick marks, gridlines, or color coding in science or math class. There is a famous book in our field called The Grammar of Graphics, by Leland Wilkinson, but it’s not a For Dummies book. This void is now filled by Alberto Cairo’s soon-to-appear new book, titled How Charts Lie: Getting Smarter about Visual Information.

As a long-time fan of Cairo’s work, I was given a preview of the book, and I thoroughly enjoyed it and recommend it as an entry point to our vibrant discipline.

In the first few chapters of the book, Cairo describes how to read a chart. Some may feel that there is not much to it but if you’re here at Junk Charts, you probably agree with Cairo’s goal. Indeed, it is easy to mis-read a chart. It’s also easy to miss the subtle and brilliant design decisions when one doesn’t pay close attention. These early chapters cover all the fundamentals to become a wiser consumer of data graphics.

***

How Charts Lie will open your eyes to how everyone uses visuals to push agendas. The book is an offshoot of a lecture tour Cairo took during the last year or so, which has drawn large crowds. He collected plenty of examples of politicians and others playing fast and loose with their visual designs. After reading this book, you can’t look at charts with a straight face!

***

In the second half of his book, Cairo moves beyond purely visual matters into analytical substance. In particular, I like the example on movie box office from Chapter 4, titled “How Charts Lie by Displaying Insufficient Data”. Visual analytics of box office receipts seems to be a perennial favorite of job-seekers in data-related fields.

The movie data is a great demonstration of why one needs to statistically adjust data. Cairo explains why Marvel’s Blank Panther is not the third highest-grossing film of all time in the U.S., as reported in the media. That is because gross receipts should be inflation-adjusted. A ticket worth $15 today cost $5 some time ago.

This discussion features a nice-looking graphic, which is a staircase chart showing how much time a #1 movie has stayed in the top position until it is replaced by the next higher grossing film.

Cairo’s discussion went further, exploring the number of theaters as a “lurking” variable. For example, Jaws opened in about 400 theaters while Star Wars: The Force Awakens debuted in 10 times as many. A chart showing per-screen inflation-adjusted gross receipts looks much differently from the original chart shown above.

***

Another highlight is Cairo’s analysis of the “cone of uncertainty” chart frequently referenced in anticipation of impending hurricanes in Florida.

Cairo and his colleagues have found that “nearly everybody who sees this map reads it wrongly.” The casual reader interprets the “cone” as a sphere of influence, showing which parts of the country will suffer damage from the impending hurricane. In other words, every part of the shaded cone will be impacted to a larger or smaller extent.

That isn’t the designer’s intention! The cone embodies uncertainty, showing which parts of the country has what chance of being hit by the impending hurricane. In the aftermath, the hurricane would have traced one specific path, and that path would have run through the cone if the predictive models were accurate. Most of the shaded cone would have escaped damage.

Even experienced data analysts are likely to mis-read this chart: as Cairo explained, the cone has a “confidence level” of 68% not 95% which is more conventional. Areas outside the cone still has a chance of being hit.

This map clinches the case for why you need to learn how to read charts. And Alberto Cairo, who is a master visual designer himself, is a sure-handed guide for the start of this rewarding journey.

I came across this older chart in the Financial Times, which is a place to find some nice graphics:

The key to success here is having a good story to tell. Blackpool is an outlier when it comes to improvement in life expectancy since 1993. Its average life expectancy has improved, but the magnitude of improvement lags other areas by quite a margin.

The design then illustrates this story in two ways.

On the right side, one sees Blackpool occupying a lone spot on the left side of the histogram. On the left chart, the gap between Blackpool and the national average is plotted over time. The gap is clearly widening; the size of the gap is labeled so the reader immediately knows it went from 1.8 to 4.9.

Although they're not labeled, the reader understand that the other two lines are the best and worst areas. The comparison between Glasgow City and Blackpool is also informative. Glasgow City, which has the worst life expectancy in the U.K. is fast catching up with Blackpool, the second worst.

I also like color-coded titles. It draws attention to Blackpool and it links the conclusion to both charts in an efficient manner.

At first glance, this graphic's message seems clear: what proportion of Americans are exceeding or lagging guidelines for consumption of different food groups. Blue for exceeding; orange for lagging. The stacked bars are lined up at the central divider - the point of meeting recommended volumes - to make it easy to compare relative proportions.

The little icons illustrating the food groups are cute and unintrusive.

It's when you read further that things start to get complicated. The last three rows display a flipping of the color scheme, with orange on the right, blue on the left. Up to this point, you may understand blue to mean over the recommended value, and orange is under. Suddenly, the orange is shown on the right side.

The designer was wrestling with a structural issue in the data. The last three food groups - sugars, fats and sodium - are things to eat less. So, having long bars on the right side is not good. The orange/blue colors should be interpreted as bad/good and not as under/over.

***The problem with this design is that it draws attention to this color flip - that is to say, it draws attention to which food groups are favored and which ones are to be avoided. This insight is actually in the metadata, not what this dataset is about.

In the following chart, I enforce the bad/good color scheme while ignoring the direction of good. The text is adjusted to use words that do not suggest direction.

Dieticians are probably distressed by this chart, given that most Americans are lagging on almost all of the recommendations.

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.

I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.

The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.

The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.

The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.

Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.

Now, look at the bubble chart at the bottom. Here it is - with all the data except the first number removed:

It is impossible to know how fast the four other train systems run after I removed the numbers. The only way a reader can comprehend this chart is to read the data inside the bubbles. This chart fails the "self-sufficiency test". The self-sufficiency test asks how much work the visual elements on the chart are doing to communicate the data; in this case, the bubbles do nothing at all.

Another problem: this chart buries its lede. The message is in the caption: how California's bullet train rates against other fast train systems. California's train speed of 220 mph is only mentioned in the text but not found in the visual.

Here is a chart that draws attention to the key message:

In a Trifecta checkup, we improved this chart by bringing the visual in sync with the central question of the chart.