Monday, April 17, 2017

Was the Judgment of Paris repeatable?

A few months ago I wrote a blog post for the Academic Wino, discussing the 1976 wine tasting that has become known as the Judgment of Paris, organized by Steven Spurrier and Patricia Gallagher. Here, wines from France were tasted along with some wines from California, and the latter acquitted themselves very well in the opinions of the tasters.

Given the outcome of this tasting, it is possibly the third most important event in the social and economic history of wine in the USA, after the imposition and then repeal of Prohibition. It was certainly made much of by the media during the Bicentennial; and this has been repeated every 10 years since. World wine was henceforth taken seriously, not just the European wines.

However, one of the things that struck me most strongly about this tasting was just how variable the results were amongst the tasters — hardly any of the tasters agreed closely with each other about the quality scoring of the wines, and especially about which wines were the best among the 10 reds (bordeaux grapes) and the 10 whites (chardonnays).

This immediately calls the repeatability of the results into question. After all, only one bottle of each wine was tasted, on one occasion, by one group of people. What would happen under other circumstances?

This is particularly important to me as a scientist, because it is the ability to independently repeat an experimental result that is considered to be the only really good evidence in science. For example, if no-one else can replicate my experiments for themselves, then my results will not be widely accepted in the scientific community.

So, given that it is common knowledge that the results of wine tastings are often barely repeatable, why was the Judgment of Paris tasting not widely repeated by other people at other places? The results were widely reported, but apparently only Frank J. Prial, writing in the New York Times (June 16 1976, p. 39), warned against taking the unreplicated wine-tasting results too seriously: "One would be foolish to take Mr Spurrier's little tasting as definitive." And yet, this is what the media very much did.

A first attempt at replication

However, Robert Lawrence Balzer did partly replicate the tasting, later in the same year. Balzer was among the earliest of the wine journalists in the USA, specializing in California wines. He was the wine columnist for the Los Angeles Times, and he also wrote his own newsletter, Robert Lawrence Balzer’s Private Guide to Food and Wine. More importantly, he had previously (in 1973) organized an important tasting of French and US wines, in New York (see Wikipedia).

So, if anyone was going to try replicating the Judgment of Paris, and publish the results, it was likely to be Balzer. The resulting tasting was discussed on pages 77-84 of Volume 6 Number 8 of his newsletter. [Thanks to Christine Graham for kindly sending me a copy of this article.]

Unfortunately, Balzer explicitly stated that his tasting was inspired by the Judgment "without any attempt at exact duplication". This is a pity, because an attempt at exact duplication is what we require. So, Balzer had only 9 of the 20 wines duplicated exactly, while some of the others differed either as to vintage or producer, and some were completely different.

For the red wines, 6 wines were identical to the Paris tasting (4 from the US, 2 French), 2 had different vintages (both French), and 2 of the Paris wines were not tasted (both US). For the white wines, 3 were identical (all US), 4 had different vintages (2 US, 2 French), 1 differed as to producer (French), 1 differed as to both vintage and producer (French), and 1 was not re-tasted (US). For the French wines, it was at that time recognized that there could be big differences between wines from different producers even when harvesting grapes from the same vineyard, and also between vintages from the same producers; and so, these differences prevent those wines from being treated as repeats of the Paris tasting.

The results for the 9 repeated wines, averaged across the 9 tasters' scores, are shown in the first graph, with the red wines in blue and the whites in green. If the scores of the two tastings were identical, then the points should lie along the pink line.

The scores for the American tasting are considerably higher than those of the Paris tasting. The Americans presumably were using the UC Davis 20-point scoring system, which the French tasters were definitely not. The Davis system does not use very much of the 20-point range, as it reserves a large part of the range for faulty wines, which was an important part of its development as a teaching tool (see Steve De Long's comparison of wine scoring systems). Even today, French tasters still often use much more of the 20-point range than do Americans (eg. La Revue du Vin de France).

In spite of this, the scores from the two tastings are correlated — indeed, 67% of the variation in the Balzer scores is directly related to the Paris scores. This is quite a good degree of repeatability. However, it is not the complete picture.

First, note that the rank order of the white wines is not the same in the two tastings — the Chateau Montelena 1973 Chardonnay was ranked first in the Paris tasting, while the Chalone Vineyard 1974 Chardonnay was ranked first in the Balzer tasting. Second, the red wines form two score groups in the Paris tasting, whereas they do not in the later tasting — indeed, the Château Montrose 1970 and the Mayacamas Vineyards 1971 Cabernet had the same average score in the Balzer tasting, whereas they had very different scores in Paris.

Perhaps more importantly, however, the erratic nature of the wine preferences among tasters was repeated in the American tasting. For example, among the red wines, only one person actually chose the Stag's Leap Wine Cellars 1973 Cabernet as their top-scoring wine, in spite of the wine getting the highest average score — and even that person scored it joint top with Château Léoville-Las-Cases 1970. In fact, the 9 tasters chose 7 different wines as their top-rank! The whites were no different, with only one person recorded as picking the Montelena as their (joint) top wine.

First, but only in one out of three tastings.

So, the things that were repeatable at the repeated tasting were a lot of the "wrong" things. The unreliability of wine tastings was strongly in evidence, and the preference rankings varied (particularly the "winner" among the whites).

A second replication

The only other published tasting that was a serious attempt to evaluate the results of the Judgment tasting occurred nearly 2 years afterwards, in January 1978, at the Vintners Club. This club was formed in San Francisco in 1971, to organize weekly wine tastings (usually 12 wines). Remarkably, the club is still extant (having had only four presidents), although tastings are now monthly, instead of weekly. The early tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988).

For the Judgment of Paris replication, 98-99 people tasted the wines over two evenings (white then red), "with Steven Spurrier himself in charge". The tasting allegedly "duplicated [the Paris] tasting to the last bottle", but in fact the vintage listed for the Bâtard-Montrachet Ramonet-Prudhon differs from the Paris event, leaving 19 duplicated wines. The Vintners Club has "always kept to the Davis point system" for its tastings; and so the scores were higher than for the Paris tasting, as discussed above.

The next graph shows the results for the 19 repeated wines, averaged across the 88 (red) and 55 (white) people who provided scores; once again, the red wines are in
blue and the whites in green.

As before, variability of the results is the name of the game. Indeed, every red wine was placed first by at least one of the tasters, as well as being placed last by at least one of the tasters; and every white wine was placed first at least once, except for the David Bruce Winery 1973 Chardonnay, and every white was placed last, except for the Chalone Vineyard 1974 Chardonnay.

The Vintners book claims that "the results were very similar to the preceding tasting in Paris", but in fact the scores from the two tastings are not well correlated at all. For the white wines, only 35% of the variation in the Vintners scores is directly related to
the Paris scores; and for the red wines it is a measly 10%. For tastings of the same wines under reasonably similar circumstances, these are very low values, and they indicate very poor repeatability.

For the red wines, the Stag's Leap Wine Cellars 1973 Cabernet was placed 1st, as it had been in the previous two tastings. However, the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet was placed 2nd, having been placed 9th in Paris. For the white wines, the Chalone Vineyard 1974 Chardonnay was placed 1st, as it had been in the Balzer tastings (3rd in Paris), with the Chateau Montelena 1973 Chardonnay placed 2nd (1st in Paris).

With one exception, the Balzer and Vintners tastings are reasonably well correlated (64% of the variation in common), although the Balzer group's scores were (on average) 1 point higher per wine than for the Vintners group. The exception is the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet, which the Balzer group scored as 14.6 and the Vintners group scored as 16.9. The Spurrier group's result is more in accord with the Balzer group, for this wine.

Conclusion

Neither of these two tastings inspires much confidence in the replicability of wine tastings, let alone the repeatability of the Judgment of Paris in particular. Even to this day, I still read of people expressing the opinion that the
difference between Californian and French wines is "obvious". Well, it
wasn't obvious to the people at any of these three tastings.

As Mike Steinberger noted in Slate (Nov. 7 2007, In blindness Veritas?): "there is a tendency to overlook the fact that wines and palates are fickle, and to read more into the results than is justified. This was certainly true of history's most famous blind tasting, the 1976 Judgment of Paris".

You will, however, have noted, I am sure, that all three tastings produced a California wine as the "top" for both the reds and whites! There is simply some disagreement about which one it is.

There seems to be little here that supports the media hoopla that ensued in 1976, at least in terms of California versus France "winners". It was the California wine industry that was the big winner, not the individual wines.

"Over almost half a century after the original Paris Tasting, Napa Valley Cabernet Sauvignons have continued to hold top positions in repeat tastings around the world. Following the 1976 Paris Tasting, where Warren’s Stag’s Leap Wine Cellars won the red wine category with its 1973 Cabernet Sauvignon, the tasting was held informally several times again in the late 1970s. In 1986 for the 10th Anniversary it was conducted at the French Culinary Institute in New York City. In 2006 for the 30th Anniversary it was held simultaneously in London and at Copia in Napa Valley. Each time and with different judges, Napa Valley wines have been judged better than their French peers and continuously come in top placements.

"The latest retasting of the same red wines and original vintages, was in Tokyo, judged by two American, two French and five Japanese wine and food professionals, for an exact replica of the original Paris tasting judged on the 20 point scale. At the conclusion of the tasting, in first place was Freemark Abbey 1969 Cabernet Sauvignon, followed by Mayacamas 1971 Cabernet Sauvignon and Chateau Mouton Rothschild 1970 in third place."

[See exhibit labeled "Judgment of Paris in Tokyo, May 2017, Final Results"]

About this blog

In the interests of doing something different to every other wine blogger, this blog will delve into the world of wine data, instead of wine itself. The intention is to ferret out some of the interesting stuff, and to bring it out into the light, for everyone to see. In particular, I will be drawing pictures of the data — as William Playfair said (in 1805): "whatever can be expressed in numbers may be represented by lines". Hopefully, this will be both interesting and informative.