The Perfect Climate Index, Take I

What is the ideal climate? After having recently experienced the silky sensual feeling of low day-night temperature difference (here), I’m adding it together with a bunch of usual suspects – high T max, long sunshine hours, litle precipitation – to form a composite measure of good weather: the Perfect Climate Index (PCI).

I’m testing the PCI on Switzerland, a country not exactly known as a weather paradise. All the more important to find the occasional dry and warm spot!

As you will see later, the PCI really sucks as a general (i.e. worldwide) measure of good climate, but it works for Switzerland. There is a challenge to make it globally applicable, and I will outline some ideas for how to do that. I was too lazy to do it myself this time, but who knows, maybe you’ll get inspired! Meanwhile, there is enough challenge in visualizing the data from over a hundred weather stations in my little home country. You can follow along, as I post the data and code for this example (see ‘Materials’ at the end).

The data from IDAweb cover the years 1968-2013, but I decided to use only the years since 1979, partly for compatibility with my previous post, but also because earlier data is sparse (see missing data plot below).

For each measure, IDAweb gives you a separate ASCII file, with lines corresponding to days. Here’s part of the T max data:

You also get a list of station names, their three-letter codes, longitudes, latitudes and altitudes.

Plotting missing data

First, I want to get an idea of the data coverage. You can see that the T max file starts with jan 01, 1971, and not 1968. This is generally the case: the time series only include years for with there are at least some non-missing values. This leads to a situation of “hidden missings”, because years with all missings don’t show up anywhere.

To avoid this sort of thing, I’m reading the data into a prefab, initially empty file containing the full temporal grid of values from jan 01, 1968 to jan 01, 2014 for all stations. The file contains N stations x N years x N days = 2’184’390 observations – the largest data set I’ve ever dealt with (shame on me)!

To tackle the missings, let’s create a missing indicator (e.g mtmax) and then sum up the number of missings per station (nmisstmax).

The variable containing the N missings is created by the -collapse- command, which, at the same time, collapses the data over stations. The resulting file has only one record per station containing its N missings-variable. In other words, we have gotten rid of all unnecessary baggage by reducing 2 million observations to 130! This data set can now be used for graphing the missings.

It might not look like much, but this graph was a killer! It took me hours of tinkering and it’s still not perfect; but it taught me quite a bit about graphing.

What it shows is easy enough to understand: the number of missings by station and year. The red lines indicate years with more than 360 days of data missing, which turns out in all cases to mean that the whole year is missing. The orange lines denote years with over 50% days missing. I didn’t add more categories since they made the display look crowded and most of these years had no missings, anyway.

You can see what I mentioned above: the data is sparse for the years up to about 1980, when ANETZ, a network of automated stations, became operational, and then there’s another improvement happening in the early 90s, for which I don’t know the reason.

Second, there are very few years with 180-360 days missing, indicating again that it’s either all or nothing: by and large, the data is missing by whole years or not at all.

What’s tricky about this graph? Although it looks like some sort of line graph, it is actually a scatter plot! And the red lines are not made of graphing symbols but of the ASCII underscore character stored in a special variable. So what’s graphed are not markers but marker labels, and here’s where the difficulties begin. You cannot control the plotting position of marker labels very finely, and that makes it hard to align them with the grid lines. In fact, after positioning the labels as well as I could, I shifted the grid lines up a tiny bit to make them meet the labels. This, in turn, caused a slight displacement of the y-axis labels relative to the grid lines. At this point, no amount of tinkering would help to fix this.

Look at the graph code (note that all my graphs use my own LF3 graph scheme, which you can get in the ‘Materials’ section of my first post):

The string variable ‘marker’ is generated to hold the plotting character. The strange curly braces you can see inside the double quotes contain a Stata SCML directive to output the superscript of the character. I did this merely because it turned out that the superscripted version could be more precisely aligned with the grid lines.

The character is positioned using the -mlabposition()- option. The argument is the plotting position given as a clock-time. Thus, the value I chose, 4, corresponds to the position of the number 4 on a clock-face that is centered at the position of the data point (where usually the graphing symbol would be placed). That’s as precise as you can get.

Grid lines in Stata originate from tick marks, of which there are major and minor ones. To shift the grid up you can simply reposition the corresponding tick marks (minor ones, in this case) by a tiny fraction. That’s what -ymt(1.25(2)120.25, noticks grid- does: it adds 0.25 to the previous position of the ticks. Note that display of the ticks themselves is suppressed by the -noticks- option. Now the station names that label the major tick marks are slightly offset with respect to the grid lines. You might think you could fix this problem in the same way, by shifting the major ticks upward. However, you would then lose the station labels, because they are attached to the particular values of the ticks. If you shift the ticks up by placing them at higher values, those new values won’t be associated with labels anymore, and the station names will disappear. Now – why not just adjust the value label of the variable holding the station names? – Because value labels can contain only integers, no reals! And that’s why this story ends there.

The fact that the plot uses no graphing symbols means there can be no legend. At least none using Stata’s -legend- option. The legend you see is constructed using the -text- command that allows you to place text on a graph. I didn’t fully figure out its inner workings, but if you sit down for an hour with the help file, I’m sure you can. I found the text positions by trial-and-error.

Finally, the graph’s scatter plot nature is hidden behind -zmap-, a user-written program by seasoned Stata expert Nick Cox, which I adapted to my purposes. It lets you plot categories of a z-variable by y and x, and color-code them differently, by layering seperate scatter plots for each category over each other. [UPDATE: Brendan Halpin has suggested an alternative, easier way to construct this graph. See first comment below and my reply -13.05.14]

The Perfect Climate Index

When you get data from somewhere, it is usually messy. Not only can it be poorly documented and labeled, there can also be weird values and missing data codes that you need to straighten out. I’ve done this for the present data, but I’m much too unfamiliar with weather data to know what kinds of values for which measures are implausible. That’s one of the reasons I chose to use the median to construct the PCI, because it is not so dependent on outliers as the mean.

After averaging the single weather measures for each calendar day across the 35 years from 1979-2013, I construct the PCI as:

My definition of a perfect climate means it’s hot, the difference between day and night temperature is small, more sunshine is better, and so is less rain. I weighted sunshine duration by 0.5 so that two more hours correspond to the effect of one more degree Celsius. Rain fall has a large spread of values that I compressed using a log-transformation.

So where are Switzerland’s dry and sunny spots? And where is it wet and cold? Let’s look at the top and bottom 10 contenders:

Lugano and Locarno, both located south of the Alpes, come out on top, followed by some places in the western, French-speaking part of Switzerland (called “La Romandie”). The bottom 10 are spread throughout the German-speaking part, where, unfortunately, I myself live. By the way, Lugano and vicinity get quite a bit of rain, but sunshine, high T’s and low temperature differences drive up the PCI.

The code again uses -glabsort- to sort panels by PCI (see my first post). The graph is a panel-data plot (-xtline-), recast as -twoway spike-, because I like the display of the values as heights as opposed to lines. In contrast to temperature values, however, the zero line is arbitrary and has no special meaning, which militates against using spikes. Decide for yourself.

Let’s look at the distribution of the PCI on a map (I have excluded locations above 1000m altitude, because otherwise they would obviously occupy the lowest ranks):

I know you expected nothing less than a chloropleth, but I don’t think I have enough data points to warrant such a display. Well, maybe one day I’ll do it anyway. In the meantime, this plot is also quite pretty. High PCIs are represented both by more saturated red hues as well as larger circles. Locations in the two highest caregories of PCI are additionally labelled. I had to play around with circle sizes, color schemes, background colors and labeling a while to find a combination that was not too busy and gave acceptable contrasts. I know there is some over-printing, but that’s fine for present purposes.

The color scheme is put together from two sequential color brewer schemes (colorbrewer2.org). Color values are given in RGB. In fact, Stata accepts not only its own named colors, but also RGB, CMYK, and HSV notation.

Not surprisingly, the PCI distribution confirms the results of the analysis in my previous post: Lugano & co. are the place to be when one is looking for some mediterranean mildness in Switzerland. Thank God for the Ticino! one is tempted to say.

The PCI, however, sucks as a global measure of pleasant climate. The main reason is its use of temperature – the hotter, the better – but this logic must break down at some point. Certainly, not even the most ardent sun worshipper will thrive at 45°C in the shade, a scorching sun, no rain, and not a single degree of cooling down in the night. Not only should T max be encoded with an optimum, e.g. 30°, but the day-night difference should also depend on it: the higher T max gets, the more nightly cooling is acceptable.

In an improved version, I’d also like to include wind speed and humidity. Personally, I hate anything that’s stronger than a mild summer breeze, and I love humid air. I would penalize wind speed relatively linearly, but perhaps set an optimum for humidity at 80%. Setting optima for variables presents an interesting challenge for mathematical implementation. You need a function that decreases (or increases) on both sides of the maximum. One way to achieve this is by using periodic functions such as the sine and cosine. But this is a task for another day – unless one of you beats me to it!

Materials

The data from IDAweb is copyright-protected. In order to let you reproduce this example I have added a small amount of random noise to the data. Please be aware that this renders the data factually invalid. Drop the data into a directory, point Stata to it, and run the example code below to produce all the figures. Download the data here.

Post navigation

2 thoughts on “The Perfect Climate Index, Take I”

For your first graph, an alternative would have been to use -twoway rbar- to draw vectors (with the horizontal option), which might give better control over vertical placement, and more control over linewidth. See the -sqindexplot- command of the SQ package (ssc install sq) for an example (without gaps between the lines). There’s an example at http://teaching.sociology.ul.ie/bhalpin/sqind1.png