This blog post discusses the use of a black background in a graph. But before we get started, I invite you to have a listen to one of my favorite songs - "Paint it Black" by the Rolling Stones. Perhaps this song subliminally persuaded people to use black backgrounds in their graphs? (just one of my conspiracy theories!)

A twitter post recently took me to a tumblr blog about a map that Seth Kadish had created. The coloring of the counties in the map represented the percent of the workers in that county who commute to another state to work. As expected, counties closer to the state borders generally have more of their workers commute to another state.

Here's a screen capture of his map (below). As you can see, it's very difficult to see the map against the black background:

Hahaha - ok, that's not actually it, but I invite you to click this link to have a look at his map! It's visually captivating (it certainly caught my attention), but once I started really trying to understand the data, I found that it was somewhat lacking in that area. The main problem is that the black background is visually very similar to the color representing the maximum commuting values. And therefore, it is very difficult to tell the two apart. For example, that round dark area in southern Florida might at first look like a county with a very high commuting value ... but it is actually a 'hole' in the map for Lake Okeechobee (showing the black background behind the map). It's very difficult to tell whether the dark area in the map are counties with high values, or holes/lakes/bays/etc in the map.

So, naturally, I downloaded the data (table S0801) from the Census website, imported the CSV file into SAS, and plotted it on my own map ... with a white background. Now the dark/black counties stand out more, and you can easily tell the difference in the map background and the data). And if you click the thumbnail below, you will even see that my SAS version has html hover-text for each of the counties.

Technical Discussion:

Why do people use black backgrounds behind graphs? What are some of the arguments for and against? I'm no expert in this area, and have not done any controlled experiments or surveys, but here are my personal thoughts...

For Black Backgrounds:

One argument for black backgrounds is that it looks better than white when projected on a screen during a presentation. In my experience, this could be true, especially if bright/vibrant colors are used along with the black background. But this probably depends on what kind of presentation you're giving, and who your audience is.

Along those same lines, sometimes you might use a black background in a graph in order to make it stand out from all the typical (white background) graphs. For example, when creating a 'headline' graph that's meant to catch people's attention, rather than being used for nitty-gritty analytics. Even then, I would guesstimate that less than 1 out of 20 of my fancy presentation graphs have a black background.

Another argument is that with some display devices (maybe some CRTs and maybe some LED screens), it consumes less energy to display 'black' pixels. Therefore using a black background could save energy, which could be an advantage in mobile devices. I did some web searches on this topic, and couldn't find any definitive studies in this area. At the very least, from what I gather, this is not the general case (especially with displays that use a 'backlight'), and should probably not be a factor for your graphics design.

Against Black Backgrounds:

Many people find graphs with black backgrounds annoying, and difficult to read. For example, how did you like the black background in the text items above? :-)

It has also been my personal experience, that when using dark background and light text in a graph, I have to make the text bolder/thicker in order to be readable. And if you resize/shrink the graph, sometimes the text becomes un-readable. This problem seldom comes up when using dark text on a white background.

When you use a black background, this makes it very 'expensive' to print. Whereas white backgrounds would use no ink to print on white paper, black backgrounds use a lot of ink to print.

My personal opinion is that black backgrounds are occasionally good when you want a certain artistic effect, but should not be used for graphs in general. Or in layman's terms ... black backgrounds are for velvet paintings.

Having spent many years in graduate school, and living in the Research Triangle Park (RTP) in North Carolina, I have a lot of friends from other countries. Therefore when I recently saw some stories & graphs about EB-5 visas (where you invest a cool half-million US $ to bypass the long lines) they caught my attention...

The EB-5 Immigrant Investor Program was created in 1990 to stimulate the U.S. economy through job creation and capital investment by foreign investors. EB-5 investors must invest in a new commercial enterprise, which creates at least 10 full-time jobs, and make a capital investment of $1,000,000 ... or $500,000 in a high-unemployment or rural area (which is usually the case).

And now for the big question - who paid $500k for these EB-5 visas last year?!?

I thought I had found the answer in an infographic on dadaviz.com where they showed a custom chart made up of bubbles (see below). It was an interesting graphic, but a little confusing. Were all the blue dots sub-regions within China? I guess Taiwan and Hong Kong might be, but definitely not Japan, etc. What countries were represented by the 'Other' bubble? Did China really have an order of magnitude more than any other country, or was that just a quirk of the way I was interpreting the graphic? I had more questions than answers...

I did some digging and found the raw data on the Department of State's web page, and finagled it into a SAS dataset (it was in a pdf file rather than simple text, therefore I couldn't import it directly). I then experimented with several different ways to plot the data, such as a bar chart - but there were just too many countries, and too big a spread from the minimum to the maximum values, for traditional graphs to work well and show all the data.

What I finally came up with was a SAS bubble map, followed by a simple sorted table. The map allows me to represent both the quantities and the geographic locations, and the table allows me to quickly see the actual values and determine which countries have higher/lower values than the others.

Click the map below to see the full-size version, with html hover-text for each country. With this map, you can easily see which countries the people getting EB-5 visas were from, and that China had way more than any other country (so many, that they actually reached the limit before the end of FY-2014).

If you were going to invest $500,000 for a US EB-5 visa, what enterprise would you invest in, and where would you locate?

Everyone loves a good conspiracy theory - hopefully you'll enjoy mine about the number of US E1 visas!

I was perusing some of the US government charts, and found one on US immigration visas that caught my attention. It was a 3D bar chart, and since I always mistrust 3D charts, I immediately assumed there was something misleading about it. I noticed that the number of E1 visas was very close to the 40,000 reference line (see circled in red below), and I wondered whether it was in fact above or below the line. It looked like it was below the line, but you know how 3D graphs are difficult to read when you're looking at them from 3D perspective angle.

Luckily their chart had a table below it, so in theory I could just easily glance at the table to see the exact number of E1 visas, and know whether or not they were above or below 40,000. But I was thwarted again! The table showed the number of E1 visas broken down into 2 groups (corresponding to the red & blue bar segments), but not the total! See the E1 row marked in the table below.

The best graphs are both beautiful and informative - a smooth blend of art and analytics. But more often than not, the two collide rather than blending smoothly...

Here is a link to a artistic infographic I recently saw posted by Vendavo on twitter. Their message (80% of your profit is generated by 20% of your customers) seemed 'plausible' ... but something just didn't seem quite right about their infographic. Upon closer scrutiny, I noticed that the slices in their pie charts did not seem to accurately represent the numbers (80% and 20%) in the text.

So, of course, I decided to make a SAS version that was both beautiful and informative (... with correctly sized pie slices!) Here's what I came up with, to show that an infographic can be both artistic and accurate!

I recently read a Washington Post article about the euro versus the dollar, and I wanted to analyze the data myself to see whether the article was simply stating the facts, or "sensationalizing" things.

The washingtonpost.com article started with the headline, "This is historic: The dollar will soon be worth more than the euro." And the article had the following graph showing the value of the euro dropping:

Based solely on the title and the graph (which is probably all that most people look at), I assumed that the exchange rate had always been about 1.25, and had recently started dropping towards 1.00, and that this was an unprecedented historic event. But as I read the details in the article, I started to become a bit more skeptical, and decided to find the actual data, and plot it myself.Read More »

Is daylight saving time the ultimate in efficiency, or is it living a lie? Here are some graphs that might help facilitate a discussion on this topic ...

With daylight saving time (DST), a whole geo/political area (such as a country) decides to set their clocks forward an hour during the 'summer' months (when the sun rises earlier and sets later) so that they can take advantage of the extra sunlight hours, without all the factories/stores/etc having to change their hours of operation.

Not all countries honor DST, and in some cases not even all the areas within a country agree to honor it. Here is a world map I created with SAS (similar to one I saw on dadaviz) that shows which areas do and don't honor DST. In general, it looks like most of North America and Europe honor DST, and countries that are close to the equator or in Asia tend not to.

Of course, it's not as easy as saying a country does or doesn't use DST -- different countries can also choose when they want to start and stop DST! For this part of my graphical analysis, I've created some graphs only for the US DST. But even plotting the data for just the US is a bit tricky, because when we start & stop DST has changed over the years. For example, in 2007 the date to go on DST moved from the first Sunday in April to the second Sunday in March, and the date to go off DST moved from the last Sunday in October to the first Sunday in November. Here's the calendar for the current year (2015) showing which days are/aren't DST days:

Looking at that calendar chart, it appears that we (in the US) are now spending over 50% of the year with our clocks adjusted forward in DST. Let's use a different chart that will make it even easier to see the percentages. We could use a bar chart with 2 bars, but I think a pie chart is more intuitive (a lot of people like to bash pie charts, but I think they are a good/intuitive way to show the data when comparing part-to-whole with a 2-slice pie). From this chart, it's evident that we're spending almost 2/3 of the year in DST!

Personally, I have mixed feelings about DST. I can see the advantages of using it in the summer when days are very long, but I think the US might have gone a bit overboard if we're spending over 1/2 of the year (actually about 2/3 of the year) living a lie and adhering to a fake time.

So, what's your opinion on DST? Does your area honor it? If your area doesn't honor DST, do your factories/stores/etc change their hours in summer and winter?

“Dear Cat,I got an email from my IT department that says:[We are nearing capacity on the Flotsam Drive. Please clear data from any folders you are no longer using so we can save disk space.Thanks,The IT Department]

Doesn’t this strike you as a bit old-fashioned? I mean, isn’t disk space practically free now?Signed,DataLover”

Dear DataLover,

My first reaction to this is that yes! You’re right! Disk space is practically free. Why are we worried about storing some extra files? I am a bit of a data hoarder, though, so perhaps my views require some analysis.

Certainly, the message coming out of providers of software and services for the Hadoop ecosystem is that a good data science citizen keeps everything. Long gone are the days when we had to carefully scrub the data, roll the files up to something compact, and get rid of the excess to free up storage space. There might be untapped value in unstructured logs, transactional databases, and other “clutter” files.

So, when is it better to keep versus eliminate data? I have a few thoughts about this.

If you can gain more from using the data than you spend to keep the data, then by all means, keep the data. Sources might be surprising. Data scientists make billions of dollars for their companies annually by making data products out of log files and other data that has historically been considered garbage or exhaust.

If the data get no use, and are old enough that data products would not benefit from them, then it is best to delete. But I would ask, if the data are not used, should they be? There could be value there.

If there is historical information about your company’s performance that can be tied to specific initiatives, then keep the data. There is something to learn here. As an example, if you can track marketing campaigns, staffing decisions, acquisitions and merger information, etc. then you can see which activities were followed by changes in revenue, customer reach, profit, market share, etc. This is not causal information, but it can direct you to your next business experiment in a hurry.

If the data can place your organization at risk, then it is prudent to eliminate. This is the case with personally identifiable information (PII), financial records that are no longer needed for audit trails, email records that may contain proprietary conversations with clients, and so on. In this case, there is more to lose from keeping the data than can be gained from eliminating the files.

And finally, if the data include pictures of your boss at the last company picnic wearing that Hello Kitty costume and dancing the electric slide, it’s probably best to just let it go. Nobody needs to see that.

Our experiences are different, so I’d love to hear your thoughts about this in the comments below. And, if you would like to spend some quality time talking data hoarding with my colleagues and me, consider coming to one of our data scientist training courses. See you in class!

Which is more important - having beautiful graphs, or accurate graphs? Let's explore this question using the locations of the world's richest billionaires...

I recently saw a beautiful map on dadaviz.com that purported to show the cities with the most billionaires. Here's a screen-capture of that map:

I decided to try creating a similar map using SAS. I found the data source on the forbes.com website, entered the data into a SAS dataset, programmatically looked up the lat/long of the cities, and plotted them on a map.

I was happy & satisfied that I had reproduced their map, until I noticed that "one of these things is not like the other..." My Shenzhen city bubble was near Hong Kong, whereas in the dadaviz map it was up north of Seoul and Beijing. I researched this (and even confirmed it with my co-workers in China), and determined that the position of Shenzhen in the SAS map is correct ... and the beautiful map on dadaviz was not accurate!

I'm not sure how the dadaviz map ended up wrong (maybe the bubbles were positioned by hand, or using hard-coded values that contained a typo?) But this is one of the reasons I prefer to use SAS and position my markers in a data driven way - which allows me to create output that is both beautiful and accurate!

Technical Details:

To create the background map, I created a grid of gray dots, across all the possible lat/long locations of the world, and then used Proc Ginside to see which dots fell within a country (and discarded the dots that weren't in a country). I looked up the lat/long positions of all the listed cities in the mapsgfk.world_cities dataset (which ships with SAS/Graph), and created annotated pies at those lat/long positions, with the area of the pie proportional to the number of billionaires. I got a little 'tricky' with annotate pie commands to draw the line from the pie and then write a label at the end of the line. Here's a link to the full SAS code.

With any software program, there are always new tips and tricks to learn, and nobody can know them all. Sometimes I even pick up tips or techniques from my students while they’re learning broader programming tips from me.

Like fine wine, instructors only get better with age. Every customer interaction we encounter, every software upgrade that happens, every SAS training course we get sent on, all add to the length and breadth of our SAS teaching knowledge.

But the absolute best learning experience hands down has got to be the SAS classroom setting. Customers come up with real life problems, and ask probing and insightful questions. It makes us instructors pull deep down into the recesses of our minds and see how far we can extend the use of SAS in creative ways.

Here’s an Enterprise Guide tip I recently learned from a student.

Open up the advanced expression builder to write an expression for a computed column. Keep CTRL key depressed while you use the scroll button on the mouse. Scroll up to increase font size, scroll down to reduce font size.

What tips do you know that you think might be new to me or those in my class? Send them along or comment here, and I’ll spread them around!

With Pi Day coming up on 3/14, I wanted to make sure all you SAS programmers know how to use the pi constant in your SAS code...

All you have to do is use constant("pi") in a data step, and you've got the value of pi out to a good many decimal places (probably enough for most any practical scenario). In this example, I let the user specify the value for a circle's radius as a macro variable, and then get the value of the pi constant in a data step, and use that value to calculate the circumference and area of a circle with that radius. I convert those calculated dataset values into macro variables, and use the calculated macro variables in various ways in a GPlot and annotated circle.

Here's the code for doing the calculations and creating the macro variables - hopefully a light bulb is going off in your head right now, and you are thinking of all kinds of ways you could reuse this code!