Author: forecastingforskeptics

When I put the search term “of the year award” into Google it found 6,840,000 results. There was Museum of the Year, Business Analyst of the Year, MP of the Year, Pension Scheme of the Year and multiple awards of Book of the Year, Player of the Year and Employee of the Year. There were even awards for Loo of the Year and one for the year’s Oddest Book Title –for which books titled “Nipples on My Knee” and “Renniks Australian Pre-Decimal and Decimal Coin Errors: The Premier Guide for Australian Pre-Decimal and Decimal Coin Errors” were hot contenders in 2017.

Presumably, many of these awards are not intended to be taken too seriously –though it’s surprising how competitive people can be when there’s a potential title on offer. After all, there have been many reports of gardeners sabotaging their rival’s prize plants in the dead of night. And the supposedly gentle hobby of birdwatching in Britain has been described as truly savage as twitchers battle to dominate the rankings (yes – there are countless birdwatching rankings).

In some cases, awards provide an incentive for the delivery of improved services or products, and they bring satisfaction and pride to the winners, though this may be countered by the collective demotivation and resentment of the many candidates who didn’t win.

Besides the elation or misery they may bring to the contenders, rankings and awards can have significant effects on our decisions. We might choose to study at a university that’s been crowned University of the Year, buy a 4 x 4 that’s Car of the Year, invest our savings on the say so of a Financial Advisor of the Year or promote an academic who’s won the Journal Paper of the Year award.

So does choosing numero uno really mean we can predict that we’ll be getting the best of the best?

To start with, we have to assume that every candidate is in the frame for the award. But, in many cases there are simply too many candidates for this to be possible. Take Book of the Year awards. In the USA, according to UNESCO, 328,259 new titles and editions were published in 2009 –the UK figure was 206,000. Even though awards may specialize in particular types of books, or genre, such as romantic novels, there are still far too many for any judging panel to handle. So the judging process has a fundamental flaw before it has started.

Even when the list of candidates is complete there can be huge problems with the ranking process. Sometimes ranking is a matter of judging whether an apple is better than orange. And if you can’t tell, then toss a coin. The award’s in place and someone has to win it, otherwise we’ll look foolish. Many years ago a relative of mine was asked at the last minute by a neighbour to join him as a judge in a competition of dance troupes of teenage girls. One of the regular judges was indisposed.

“But I know nothing at all about dancing,” my relative protested.

“Don’t worry, when you get the ranking form, just write anything down. That’s what I do,” chuckled the neighbour.

Somewhere a middle-aged women might be fondly recalling the day she was in a group that won “Dance Troupe of the Year award” and proudly showing photographs to her grandchildren.

Even if we try to do an honest job of judging ranks, we are likely to be stumped when there are several criteria to be considered. When trying to rank new cars to choose the Car of the Year, one model might be spacious, stylish, full of gadgets and reliable, but it might also consume lots of petrol and produce a bumpy and noisy ride even on immaculate road surfaces. Another might be quiet, economical and reliable, but it might look rather staid and feel a little cramped compared to the first car. Faced with a list of 20 new cars launched by manufacturers this year, all with different pros and cons like these, how do you rank them?

Psychologists, have found that in situations like this we struggle to process all the information involved. In particular, we face the challenge of making trade-offs in our heads. How much more leg-room would compensate for a car that does ten miles less to the gallon? Would extra gadgets be sufficient to make up for a bumpier ride? To avoid a headache we resort to simplifications. One strategy people use is to rank the contenders on the one criterion we consider to be most important – reliability perhaps –and forget the rest. If two contenders tie on this criterion, rank them on the second most important –fuel economy say -and so on.

This simple method glories under the incongruously lengthy title, lexicographic ranking, because it reflects the way in which words are ordered in a dictionary. The worry is that it might lead to a Car of the Year that’s highly reliable, but awful in every other respect. And who’s to say that reliability is the most important criterion anyway? It’s a matter of personal preferences.

Choosing a Car of the Year in this way would not sound credible when we announce our decision to the media. We need a method that uses all the criteria, but still prevents that headache. So another strategy is to set some tolerable limits for the criterion and eliminate all the cars from our list that fail to meet each limit in turn. Get rid of the cars that do less than 45 miles per gallon, then those that have less than four feet of leg room and so on until hopefully, there’s just one car left.

The trouble is you might be rejecting a car that does 44 miles to the gallon, but is brilliant in every other way. Some employers use this method, called elimination-by-aspects, when they need to whittle down a large pile of application forms from job candidates to a manageable short-list. One wonders how many excellent people have never made short-lists because their examination grades were marginally short of some arbitrary standard or because their experience in the relevant line of work was a month short of the number of years hastily determined by a harassed manager.

In the end, I suspect that most of us resort to gut feel. Choose the car we like the best, or the one that will give the impression that we have sophisticated tastes. Choose the job candidate who emitted the right vibes and seemed the most convivial. After all, we can always retrospectively conjure up a rationale for our choice so that it looks rigorous.

And, if that applies to those who judge and publish ‘of the Year’ awards -which seems likely – we would be well advised to treat these awards with skeptism -especially, when it comes to making our own important choices, and predicting which option will turn out to be the best for us.

It sounds like a nonsense question. We forecast in the hope of getting an accurate picture of what the future will be like. We’ll be tempted to tweet less-than-complimentary messages about weather forecasters if their predicted barbecue weekend turns into a drenching for guests at an outdoor party we’ve organized. Similarly, a forecast that tells us we can expect to sell 2000 units next week when we only sell 1500 will be galling if we wasted resources producing the surplus 500 units that now have to be dumped.

What I’m referring to is the use of accuracy to decide which forecasting method -or which human forecaster – we should employ in a particular circumstance. Typically, we look at the track record of the method or person. Or we provide them with some past data so any patterns can be detected. But then we keep the latest data hidden and see how accurately they can forecast these unseen observations, which are referred to as holdout data. This raises three practical problems.

First, to be confident in our choice of method or person we need to test a large number of their forecasts. One or two seemingly brilliant forecasts are not enough. Forecasters can be lucky –by chance an outrageous forecast coincides with what actually happens. A maverick analyst foresees a recession no one else had seen coming. Or a TV pundit risks ridicule to predict a stock market crash and a week later the market nosedives. In neither case can we conclude that the forecaster has some mystical powers of foresight. Research suggests that the opposite is more likely to be true.

But, if we need to assess accuracy over a large number of forecasts, where do we get them from? In many circumstances, there is a dearth of opportunities for evaluating forecast performance. Events like elections occur relatively infrequently so we have few chances to assess how skilled a politics expert is in identifying the most likely winner. Product life cycles are getting shorter so we usually have only a limited amount of past demand data. Once we’ve used some of this data to detect patterns there is not much left for the holdout observations. This means that, when comparing competing methods, it’s tempting to use just one, two or three unseen observations and then declare one method as the clear winner.

Of course, we can test the expert on lots of elections in different countries, just as we can test a statistical forecasting method on lots of different short life cycle products, if they are available. But if the expert only claims knowledge of the political landscape of one country, or if the products have different demand characteristics, our testing is likely to mislead us.

This leads to a related problem. As they warn in investment advertisements: past performance is no guarantee of future performance. In a rapidly changing world, what worked in the past may be a poor guide to what will work in the future. Forecasting has been compared to steering a ship by studying its wake. Similarly, to focus on past accuracy is to focus on history in an exercise that should be all about the future.

The third problem is how do we measure accuracy? There are a host of different measures, ranging from mean absolute errors to Brier scores, depending on the type of forecast being made. These make different –and often undeclared – assumptions about the seriousness of differences between the forecast and the outcome. As a result, they can lead to contradictory findings. Method A is more accurate than Method B on one accuracy measure, but B is more accurate than A on another measure. Moreover, the assumptions about the consequences of forecast-outcome discrepancies rarely coincide with the true consequences in a given situation –such as a soaking for my party guests, loss of customer goodwill through under production or the costs of surplus stocks.

So what’s the answer? In decision making it’s often said that you should not judge the quality of a decision by its outcome. I might decide to gamble everything I own on a 500 to 1 outsider in horse race and, incredibly, I win. That’s a great outcome. But most people would agree that it was an awful, reckless decision. We should judge the quality of a decision by the process that underpinned it. Was accurate, cost-effective, information gathered? Were all stakeholders consulted? Were risks assessed, and so on? The same should be true of forecasting.

Nearly twenty years ago, the Wharton Professor, Scott Armstrong, led the Forecasting Principles project, which was designed to identify the characteristics of a good forecasting process. The M-Competitions led by Spyros Makridakis have provided further guidance. Later work, such as the Good Judgment project led by Philip Tetlock have added to our knowledge of what makes a ‘good’ forecast –albeit in a more restricted range of contexts. Of course, the validity of the principles uncovered by these projects depends on their ability to improve the likelihood of an accurate or well calibrated forecast. This validity is established by testing them on very large numbers of forecasts under different conditions and using a range of measures.

As we’ve seen, in many practical situations we don’t have access to this richness of data to test each of our candidates for the title of ‘Best Forecasting Method’. So we should spend more time comparing how well they adhere to principles of good forecasting and give less prominence to fortuitous short-term bursts of apparent accuracy or a few unlucky instances that seem to suggest poor performance.

A target is what we would like to achieve, even though we may think it’s unlikely. Companies often set sales targets for their staff to motivate them, not because they think the chosen level of sales is the most probable level that will ensue.

We make a decision when we choose a particular outcome from all those that might occur in the future because we think this choice will bring us the most benefit -not because we think it’s the most probable or the expected outcome. If I’m in a marketing department I might think that sales of 500 units are most likely next month, but I choose to present a ‘forecast’ of 400 units. By keeping the forecasts low it’s likely that I’ll be able to boast to senior managers that our brilliant marketing efforts have enabled us to exceed the forecast. If I do this, I’m not forecasting. I’m decision making.

If I’m an economic forecaster and circumstances are changing I might prefer to stick to my original ‘forecast’ of 2% growth, even though I think that 1.6% is now most likely. Changing my forecast too often might be seen as a sign of incompetence. Alternatively, I might play safe and stick to what others are forecasting –even though I think they are likely to be wrong. That way I won’t be exposed if I’m wrong. This is known as herding.

In other circumstances, it might pay me to deliberately make my ‘forecast’ different to others. If I’m the one person who says there is going to be a recession, when everyone else is forecasting growth, I’ll be seen as a brilliant prophet if the economy goes into a slump. I reason that my ‘forecast’ will soon be forgotten if I’m wrong –and anyway I have a catalogue of excuses ready to explain away the blunder.

Decisions masquerading as forecasts are particularly prevalent when forecasting gets mixed up with politics. In organisations, people often have a temptation to exaggerate their forecasts to obtain more funding for their departments. Even the International Monetary Fund (IMF) is not immune from political influence. There is evidence that governments of countries that are politically aligned with the US tend to receive favourable ‘forecasts’ of growth and inflation when they are coming up for re-election. The US is the major funder of the IMF.

Then there are those regular scary weather ‘forecasts’ in tabloid newspapers. ‘Forecasts’ of snowmageddons lasting for three months or summers that will be chillier than winter are outcomes chosen by editors to sell their papers. They know that readers will have long forgotten these headlines by the time the paper hits the recycling box.

The difference between forecasts, targets and decisions is more than a semantic quibble. It can cause confusion and inefficiency in organisations and mislead people in their decisions. Two eminent forecasters, Michael Clements and Sir David Hendry, define a forecast as simply “any statement about the future”. A more specific definition would be helpful. How about: “an honest expectation of what will occur at a specified time in the future based on information that is currently available”. I am sure this can be improved upon, but at least it’s a start.

The publicity surrounding a set of depressing growth forecasts for the UK economy last week has, inevitably, been accompanied by a chorus of commentators reminding us that forecasts are almost always wrong. In fact, you can seldom say that a ‘proper’ forecast is wrong. By a proper forecast I mean one that acknowledges the unavoidable uncertainty we face when trying to estimate what may prevail in the future.

The forecasts scattered across the headlines are invariably single numbers: 1.3% growth in 2020, 2.2% inflation in 2019, a 4.3% rate of unemployment in quarter 3 of 2018. But, underlying these figures, and less attractive to newspaper editors, there is usually a more detailed assessment of a range of possible futures and their associated probabilities. A forecaster’s model may suggest, for example, that there is a 10% chance that growth will be less than one percent, a 15% chance that it will be between one and two percent and so on. The single number that surfaces –a so called point forecast – simply represents an average of all the possible outcomes that can be foreseen –after taking into account their chances of occurrence.

In theory, a point forecast of 1.3% growth for 2020 means that, if we could re-run the 2020 economy a large number of times allowing different combinations of chance events to occur each time then, on average, we would expect to have growth of 1.3%. Of course, we will only experience the 2020 economy once so we will never know what this true average would have been. To claim that a point forecast is wrong amounts to saying that an average is wrong when you’ve only seen one outcome. Imagine someone concluding that an estimate that the average height of American men is 69.3 inches must be wrong because they have just been speaking to an American man who is 73 inches tall. Condemning a single forecast as being wrong is no different.

The same applies to forecast of events. If I forecast that you won’t win the jackpot in the National Lottery next week, I must mean that I think this is the most likely outcome of your gamble. Thinking otherwise would suggest that I have delusions that I can see the future with certainty – a trait usually reserved for astrologers, necromancers and their like. If you win the jackpot, you can’t say my forecast was wrong. Not winning was still the most likely outcome even though things didn’t turn out that way. Similarly, if I forecast that Manchester United will beat Arsenal when they next play soccer at Old Trafford and Arsenal win, this does not prove that a Manchester United win was not the most likely result. If the game was replayed a hundred times, Manchester United might win 75% of the time.

In a single result we just don’t have the luxury of being sure that the most likely event has revealed itself.

Inaccurate sales forecasts are the bane of many managers’ lives. And they can be costly too, with disappointed customers who will never do business with you again or warehouses stuffed with goods that no one wants. Back in 2001 Nike allegedly suffered $100 million in lost sales when their forecasts told them to order $90 million worth of shoes that sold badly and to cut back on orders for popular sneakers like the Airforce 1. We can seldom predict sales with perfect accuracy, but there are common mistakes that company forecasters can avoid.

If you feed computers with good quality software they will crunch through mountains of data, producing forecasts that make optimal use of information buried in sales histories and market research data. But, like Henry Ford, who believed history is ‘more or less bunk’, many managers cannot resist the temptation to jettison data they think is past its use-by date. ‘Back then the trends were different,’ I’ve heard them say when ‘back then’ is a mere couple of months or years ago. Modern statistical forecasting techniques can adapt to changing trends, while also exploiting relevant past patterns that have remained stable over time. Mistake 1 is to dump most of your data.

Despite the power of computers, there are times when judgmental intervention is justified. A forthcoming sales promotion perhaps, a hike in VAT or a new competitor – they are all things the computer might be unaware of so it makes sense to adjust its forecasts up or down. However, people can seem addicted to overriding computer forecasts. In a food company I visited over 90% were changed for no apparent reason, except perhaps to justify the forecasters’ role. All this effort served only to damage accuracy. Mistake 2 is over adjusting forecasts. Reserve your interventions for important future events.

Politics is a more insidious reason why people change forecasts. Managers may deliberately keep their forecasts low so they can proudly announce each month that they have exceeded their forecast yet again. Alternatively, a high sales forecast can bring kudos because it will please the boss or attract a chunkier budget to one’s department. These machinations have little to do with genuine expectations of future sales. It can be hard to avoid, but mistake 3 is mixing politics with forecasts.

Sometimes there is confusion about what a forecast actually is. It’s not a target -that’s a sales figure you set to motivate people -not necessarily what you think will happen. Nor is it a decision. Staff working for one major retailer were confusing the sales forecast –the computer’s estimate of the most likely level of sales -with their decision on how much stock to hold. If the computer forecast sales of 200 units they might decide to stock 220 units in case of unexpectedly high demand. But they then referred to the 220 as the forecast. Other managers thought that this was the most likely level of demand and complained how awful these apparent forecasts were. Mistake 4 is to confuse forecasts with targets and decisions.

In some companies, groups of managers meet to approve computer predictions or pool their judgments on the future prospects for sales. But psychologists tell us that group dynamics can result in strange outcomes. In cohesive groups, or where the boss is at the helm, members may be reluctant to ‘rock the boat’. As a result, judgments can coalesce around highly implausible forecasts, as everyone rushes to support the prevailing view. Allowing this phenomenon, known as groupthink, to distort your forecasts is mistake number five.

The final mistake is being deceived by randomness. Consumers’ whims, accidently broken products that need replacing, advertisements that by chance catch a person’s eye and a host of other factors mean that a proportion of sales will be unpredictable. Despite this, we can be too willing to junk forecasts that are as accurate as possible because they have not been spot on. We see illusory patterns in sales graphs we think the computer has missed and wrongly think a freak sales figure is a sign of fundamental change. Even the best forecasts will usually differ from actual sales. But they will perform much better than forecasts that are constantly revised in a futile attempt to capture every random twist and turn in sales.

Avoiding these mistakes will not guarantee perfect forecasts. But you should see accuracy improve -and that’s a confident forecast.

Can ‘most liveable city rankings’ predict how pleasant it will be to live in different cities?

Pity the people of Vienna in 2016. They didn’t live in the most liveable city in the world. The city of Freud, Mahler, Schubert and Schrödinger, replete with the glorious buildings of the Habsburg empire, had only scored 97.4 in the Economist’s ranking of the livability of world cities.

Robert Doyle, the Lord Mayor of Melbourne, which had pipped Vienna to the title, was tweeting that it was a ‘Great day to be a Melburnian’. Melbourne’s winning score was 97.5, 0.1 above Vienna’s. At least the disappointed denizens of Vienna didn’t have to put up with life in Hamburg, which scored a mere 95.0 and languished in tenth place. Worse still, they could wake up one day and find themselves dwelling in London –down in 53rd place.

What a difference that extra score of 0.1 made. As the American poet William C. Bryant said: “Winning isn’t everything, but it beats anything in second place”. The headlines around the world were all about Melbourne. Rowing crews were pictured as they lined up on the Yarra river against a background of grassy banks, trees and surging skyscrapers that shouted modernity and prosperity. And Melbourne’s affluent beach-lined suburb, Brighton, got in on the act. Its residents could be seen tanning themselves on golden sands, fringed by gaily-painted beach huts, palm trees and sprawling mansions (median price: £1.6 million).

But it seemed that all was not well in this urban paradise. A press conference called by the Deputy Lord Mayor to promote Melbourne’s ‘top of the world’ ranking was interrupted by a woman shouting: “It’s disgusting!” and “Melbourne should be ashamed of itself”. She was protesting the 74% increase, over two years, in the average number of people who were sleeping rough in the city’s central business district. The current estimate was 247 people.

Then there was a survey that had found that nearly half of Melburnians were frustrated by the city’s high cost of living. And, while Melbourne had scored a perfect 100 out of 100 for its education, healthcare and infrastructure, the local paper could not resist contrasting this with the experiences of rush-hour travellers on the city’s Punt Road or the thousands of people awaiting elective surgery.

OK, nowhere is perfect, but a score of 97.5 looks pretty close to perfection to me. It raises the question of whether you can represent liveability (whatever that may be) as experienced by millions of people by a single number measured to one decimal place.

It turns out that the Economist’s table isn’t primarily designed to represent the experience of ordinary citizens in the places it covers. Instead, its main role is to provide guidance to multi-national companies when calculating the relocation packages that should be awarded to employees to allow for the gloom or pleasure of moving to a new city. A score above 80, the Economist suggests, should attract no extra allowance, 70 to 80 is worth an extra 5% of salary, while 50 or less should earn mobile global talent an extra 20%. All of the criteria used in the ranking are focused on the needs of ex-pats. Its five main categories, with weights in brackets, are stability (25%), which covers factors such as threat of crime and terrorism, healthcare (20%), culture and environment (25%), education (10%) and infrastructure (20%). There’s no reference to cost of living, perhaps because this doesn’t concern global talent whose salaries already take this into account.

So the league table is actually intended as a ranking of ‘liveability of cities for people who work for multi-national companies and who are relocated there’. But that doesn’t make for crisp newspaper headlines or stories. As a result the distorted perception emerges that it’s a measure of liveability for ordinary folk. They might proudly announce ‘I live in the world’s most liveable city’, not knowing that the measure has little to do with them.

But are the rankings in the table even meaningful to ex-pats? What about aesthetics, friendliness and a sense of community? And look at those weights. Where do they come from? Who is to say that stability should have a weight of 25%, while infrastructure only gets 20%? I might be much more concerned about the awful road conditions on my daily commute than the remote threat of crime in the exclusive neighbourhood where I live.

Despite their scientific veneer the weights are arbitrary and, almost certainly, don’t reflect the differences between the best and worst performances on each criterion. Make a small change to those weights and the presses will be rolling triumphantly in Vienna, or even Hamburg, celebrating their city as the best in the world. In this case, there would doubtless be serious investigations and crisis meetings in Melbourne to establish why their city was suddenly in decline. And all because of a tweak in an anonymous league table compiler’s weights.

For many people the recent warmongering by North Korea, with its threat to dispatch inter-continental ballistic missiles to the U.S island of Guam, brought back nervous memories of the Cuban missile crisis of 1962. Crises like these can incite brinksmanship at its most dangerous, where a single miscalculation by a volatile adversary could threaten the entire world. But why is the regime in Pyongyang taking such risks? In any conflict with the USA, it would doubtless be destroyed, along with the country that it governs so cruelly.

It is, of course, impossible to access the thinking of such a secretive regime, but we can be sure that it is autocratic and that its leader, Kim Jong-un, is unlikely to brook dissent. After all, in February 2017 he is believed to have ordered the killing of his brother, Kim Jong-nam, a critic of the regime, at a Malaysian airport. In 2013 he had his uncle, Jan Song-thaek, executed for alleged treachery. Intolerance of opposition is at the heart of groupthink – a phenomenon identified by the psychologist Irving Janis more than forty years ago. The worry is that groupthink can lead to irrational and dangerous risk taking.

Janis described groupthink after studying a number of calamitous decisions by the US government, such as the Bay of Pigs invasion of Cuba in 1961 –which, arguably, led directly to the Cuban missile crisis. He found that decision making groups, where there is pressure to conform to the prevailing view, or that of a directive leader, can develop illusions of invulnerability, excessive optimism and a belief in the group’s inherent morality, irrespective of what it plans to do. Rivals and enemies are stereotyped as evil, weak and stupid so any threat they seem to pose is minimised.

This departure from reality is fostered by members of the group collectively rationalising what is being proposed -rather than subjecting proposals to a critical evaluation. Any potential doubters are encouraged to minimise the importance of their concerns and counter arguments and remain silent. The group’s overconfidence is exacerbated by so-called mindguards who filter information to ensure that members only receive messages consistent with their favoured course of action. Groupthink thrives when the group is insulated and when it perceives itself to be under attack from an outside source. The sanguine illusions it creates can therefore act as a coping mechanism to assuage anxiety and tensions.

Many other examples of groupthink have been identified in the years since Janis’s first study. NASA’s huge risky decision to launch the Challenger space shuttle on 28 January 1986, despite the fact that the vehicle’s rubber O-rings had not been tested in temperatures as low as those that prevailed that morning, has been blamed on the phenomenon. The shuttle exploded 73 seconds after launching when an O-ring failed.

Similarly, groupthink may have been responsible for the collapse of Swiss Air, and the troubles of British Airways and Marks and Spencer, in the 1990s. The last two had previously been considered ‘darlings of the stock exchange’. For Marks and Spencer’s senior managers this served only to intensify their perceptions of invulnerability and their disdain for dissenting voices.

Good decision-making involves searching for alternative courses of action and testing these to see how they fare when the arguments favouring them are challenged in open critical debate. It involves a willingness to search for new information and a careful examination of the risks that each option might carry.

While we have no direct knowledge of Pyongyang’s machinations, the classic precursors and symptoms of groupthink appear to be writ large. And that may explain the regime’s alarming and potentially suicidal threats.