Posts categorized "Weblogs"

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data.

Observational

the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods

No Controls

the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study.

Seemingly Complete

it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently discovered to be wrong when the analyst digs into the data. A recent example is the Tesla auto-pilot analysis: even though in theory Tesla should have data on all its vehicles, the spreadsheet contains a large number of missing values.

Adapted

the fatality data are collected for a number of uses, none of which is to investigate the potential effect of 420 Cannabis Day. Adapted data is sometimes called found data or data exhaust

Merged

For this analysis, the researchers did not merge datasets. Most of the time, they do. For example, one of the commenters suggests looking at the effect of temperature. To do that requires merging local temperature data with the fatality data. Merging data creates all kinds of potential data quality issues.

***

In this post, we shall forget about the conclusion of the previous post, that April 20 may not be extraordinary. We accept that April 20 is an unusual day.

The first question to ask is: unusual in what way?

Let's look at the histogram again:

April 20 is unusual in having a higher number of fatal car crashes compared to the average of April 13 and 27.

That is what we learned from the data. Our next question is: why is April 20 worse?

According to the original study, the reason for the excess fatalities is excess cannabis consumption on April 20 because 420 is cannabis celebration day.

But at this point, we only have story time. Story time is the spinning of grand stories based on tiny morsels of data. The moment hits you in the second half of a newspaper article or research report after the author presents the data analyses, when you realize that story-telling has begun, and the report strays far from the evidence.

In this case, it's the link between excess fatalities and excess cannabis consumption that is tenuous. The problem goes back to OCCAM data, and lack of proper controls. If we could perform an experiment, the evidence would have been interpreted more directly.

The database of fatalities does not contain data on cannabis consumption. The original study has some info on "Drug police report" with over 60 percent of the cases listed as "not tested or not reported". This information is not used to argue one way or another about cannabis consumption.

The next step for this type of study is finding corroborating evidenceto support the causal story. For example, are more of these accidents occurring around neighborhoods in which 420 Day is being celebrated? Can we find neighborhoods that only started celebrating 420 Day after a certain year and look at whether a jump in crash fatalities occurred after that year? Do people drive more or less frequently after they smoke weed? Are there proxies for cannabis consumption? (for example, maybe cannabis users are more likely to drive certain cars.) etc.

Harper and Palayew looked into whether the crash ratio got worse over time because cannabis consumption may have increased over time. They failed to see this, which weakens the conclusion.

EPA commissioner Scott Pruitt sinks to a new low in his dogged campaign against evidence-based decision-making. (See my previous posts in which I dissected his other tactics here and here.) He directs his agency to propose new rules for disclosure of data and methods for climate science, placing himself in charge of opining whether a given study has enough disclosure to be allowed into evidence.

Slate’s Daniel Engber, someone I respect, weighs into this debate with a long-winded, confusing piece that loses its way. Pruitt tries to frame his anti-science crusade around the cloak of “open science/reproducibility”, an otherwise seminal movement that is rattling certain research fields such as social psychology. I encourage you to read what Engber has to say.

Here are a few things to bear in mind when thinking about this issue.

The type of data analyses that underpins climate science is of a different nature than the studies being attacked in the social psychology literature:

The replication crisis concerns the failure to reproduce results from randomized, controlled experiments, intended to prove a cause-effect relationship. Data are collected after researchers design the experiments. A different set of researchers can follow the design to amass additional data in a replicated study. In climate studies, the scientists do not and cannot run experiments. Climate models use observational data, much of which are reconstructed or indirect measurements. These models are validated using different means, and so the proposed guidelines appealing to replication cannot be effective.

The most controversial social-psychology studies make strong claims of small effects in the presence of high levels of uncertainty (for example, see this article about power pose research by Andrew Gelman and me). Andrew has a lot more to say about this research setting on his blog – in short, treat those studies with a huge grain of salt. By contrast, the key assertions made by climate scientists concern huge effects, e.g. the level of greenhouse gases in the atmosphere has not ever decreased since the start of the Industrial Revolution. More disclosure, or repeated requests for additional disclosure, will not alter that finding, but will delay the finding from reaching the public.

As I read Engber’s article, the words “manufactured crisis” kept popping in my head. There isn’t a crisis over climate scientists unwilling to explain their methods and reasoning. There isn’t a crisis over climate researchers sacrificing “quality” in the name of “quick” results. Quite the opposite: there is already more disclosure in climate research compared to other fields.

The policy proposal has little to do with promoting “open science.” A huge number of scientists from diverse fields participate in the worldwide collaboration that produces the consensus climate models and results. If that is not “open,” I don’t know what is. The standard of disclosure is deplorable in many other fields: take, for example, all the grand claims about AI or machine learning coming out of industry labs, for which there has been selective disclosure on data, models, codes or methods.

The policy proposal has little to do with increasing transparency in climate science. Year after year, thousands of scientists compile well-organized compendia of their models and results, making it easy to understand how they have come to their conclusions. By contrast, when there are commercial interests involved, data are often withheld or delayed – clinical trials data being a notorious example.

The new rule moves the judge of what is “reasonable disclosure” from the community of scientists to the EPA Commissioner who has no expertise in research science. The status quo is obviously superior. Currently, a researcher who makes a bold claim must convince the community to accept his or her finding. If others ignore the study, due to inadequate disclosure, it will wither away. The proposal will make Scott Pruitt and/or non-experts decide what’s admissible, shoving the community of scientists to the curb. Engber argues that this problem can be solved by appointing a review board but he fails to convince me why a board of a dozen people would exceed the work of an entire community.

***

The EPA proposal is a set of remedies to manufactured crises that don’t exist. The reason behind this proposal can be understood in the lens of change management within organizations. Climate researchers are advocating changes in our approach to managing the environment using evidence from data analyses. The skeptics, including industries that fear suffering lower profits from adopting such changes, want the status quo.

This playing field is uneven, and it is played out not just in climate science but in every business decision in which the data analyses suggest that the status-quo strategies are deficient. To play the skeptic’s role, one doesn’t have to have any data, or theory, or anything at all. To keep the status quo, one just needs to stall by asking questions, requesting more information, ordering more research, etc.

It gets worse in the age of Big Data. By taking the data and conducting one’s own analyses, a skeptic can stall even better than ever before – it takes less time to generate useless analyses than to explain the flaws within such analyses.

If the EPA proposal is accepted, I’m afraid we are beginning to see the ill effects of “big data.” The nightmare scenario I outlined in my book, Numbersense, is taking shape. Here is what I predicted:

More data inevitably results in more time spent arguing, validating, reconciling, and replicating. All of these activities create doubt and confusion. There is a real danger that Big Data moves us backward, not forward. It threatens to take science back to the Dark Ages, as bad theories gain ground by gathering bad evidence and drowning out good theories. (p.13)

The controversy over Facebook user data is simmering. Here is a summary of the latest developments.

***

Buzzfeed published a leaked memo from one of Facebook's top executives, titled "The Ugly," that describes a disturbing philosophy that both reflects an understanding of the power of private data in transforming lives and a reckless intentional disregard for potential harm. The memo ought to be read in full but the most shocking sentences are:

Maybe someone dies in a terrorist attack coordinated on our tools. And still we connect people. The ugly truth is that we believe in connecting people so deeply that anything that allows us to connect more people more often is *de facto* good...That’s why all the work we do in growth is justified. All the questionable contact importing practices. All the subtle language that helps people stay searchable by friends...

Once the memo leaked, the author claimed that it was a "provocation." His boss claimed that he disagreed with this memo.

***

Facebook has suspended two more organizations from getting access to user data. For example, a Canadian company AggregateIQ that is thought to be an arm of SCL, which is involved in the Cambridge Analytica scandal. Also, Cubeyou, a California company, is alleged to have resold data collected by a University of Cambridge academic center.

These cases involve shady practices in which users are coaxed or tricked into providing personal data through "gamification" (e.g. quizzes) and consent language hidden inside terms and conditions that everyone knows few people read. In each case, Facebook accused academic researchers of profiting from data collected under the pretense of academic research. The academics countered that they disclosed to Facebook users in terms and conditions that the data are collected for both research and commercial usage.

So, everyone believes s/he is acting by the rules. Why is anyone apologizing?

***

Facebook supporters say that the social-media giant needs revenues from selling user data to compensate for the core product being free.

Strangely, the current controversy has nothing to do with Facebook selling user data for dollars. All three suspended accounts supposedly "stole" data from Facebook by posing as academics so they took the data or free. In fact, Facebook drew attention to these cases by announcing the suspensions.

The real story, which hasn't yet been told, concerns the commercial deals in which Facebook earns revenues from selling user data to third parties.

As I indicated in my 7 Principles post, there is no need to hide data collection if the collected data benefit users; and businesses can derive value by analyzing user data without selling them to third-parties.

***

Another report confirms what I said here before: that privacy is the civil rights issue of our day. We are quickly heading to the world in which privacy is stripped by default, and purchased by the privileged.

TechCrunch tells us that Facebook CEO deleted messages from user inboxes but Facebook users are not allowed to delete any messages they have ever sent.

Congress may have joined the circus for publicity but it is far from innocent. Congress has passed various laws that stifle our right to privacy. Most recently, when President Trump passed the omnibus spending bill, Congress snuck in the CLOUD Act, which

is a surveillance bill that allows the US and foreign governments to obtain your online data directly from service providers like Facebook, Google, Slack, etc., without a warrant. The EFF has called it “a new backdoor around the Fourth Amendment.”

I have been contributing to Andrew's thread on how to get into the data science field. A recent college grad with a degree in environmental science and minor in statistics wants a job. Andrew suggests getting a job in industry - which I think is an excellent suggestion.

Here is my advice:

Figure out what he enjoys doing – is it coding or is it problem solving? Those are two different jobs, one is software engineering, the other is more statistics and analyses. If he is in NYC, come to one of my public lectures at NYPL in which I explain how to pick a career path within this wide and exciting field. [The next one is on the May schedule.]

Once he has picked an area, and hopefully also an industry, then he needs to reach out and talk to as many people in industry as possible. Go to networking events and meetups.

Then apply to jobs. The job search is a job in itself; keep applying until someone gives you a chance. You will encounter lots of rejection but keep trying.

If nothing is working, consider going to a bootcamp. They are set up to give you practical skills that appeal to hiring managers. Talk to the bootcamp organizers to get a sense of what their vision is, and see if it’d help you make your case.

One reason I have organized a bootcamp is that for some, it will be very difficult to break into the field without extra help - both filling knowledge gaps and making industry connections. I give the above advice to my students as well. They need to find a job that matches their temperament, and then work hard at convincing hiring managers to take a chance.

***

Next Tuesday, we are hosting an Open House.

If you're interested in learning about our vision, drop by and say hello.

Rachel Thomas's article came onto my twitter feed. It caught my attention because it has a click-baity title "How (and why) to create a good validation set."

Or, I thought it was click bait but she is really serious about this. (For those not familiar with the literature, we don't use all historical data to build machine learning models. The historical data are split, typically at random, into training and validation sets. The validation set is supposed to simulate new data the algorithms haven't seen before, a sort of honest check of the model.) She makes some alarmist claims here:

there is such a thing as a "poorly chosen" validation set

random selection is not a good way to make a validation set, a "poor choice for many real-world problems"

the analyst should manufacture a validation set

the validation set should be representative of future, currently-unseen, data

Even though I don't like any of her advice, I can't disagree with her diagnosis:

An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again.

***

One of the examples given is a response function that has a time trend.

If this model does not detect the trend, indeed the prediction will have poor accuracy on real-world data. She is making a claim that a validation set based on a pre-post time split is better than a random selection.

Since this is a simple linear trend, either way of making the validation set will capture the trend. So what makes the model fail in production is not the presence of this trend but a shift of the trajectory after the model is deployed. But the choice of validation set won't help prevent the problem.

The downside of the pre-post split is the presence of many time-varying predictors. A naive example: if an on-off switch just happens to be pressed at that time split point, then all your training examples have the "on" condition while your validation examples have "off".

Manufacturing the validation set to reflect some unknown future trends creates a conceptual difficulty. The training set is now materially different from the validation set, so why would we expect the trained model to perform well on the validation set? And how much degradation in validation set performance is considered the right price to pay for potentially better in-market performance? That question boils down to how much you want to generalize the data, and at the core of the statistical view of the modeling problem.

***

The subtext of the article is that if the model doesn't work, fix the data. I tend to want to fix the model. If it doesn't work in production because the nature of the time trend has shifted, then adjust the model to include the new time trend.

Diagnosing the difference between production data and historical data is part of good model hygiene. It's very hard to predict unexpected shifts in the data and even if you could, you wouldn't have any training data to support such shifts.

The "data fix" is not the solution. Refining one's model is.

PS. While I don't agree with designing your validation set, I do advise selecting your historical dataset carefully, and think about which units to include or exclude from the modeling process, which Rachel discusses at the end of her post.

Some of you might be wondering why I haven’t commented on the front-page feature in New York Times Magazine two weeks ago, titled “When the Revolution Came for Amy Cuddy.” Susan Dominus, the author of the front-page feature, cited an article about Cuddy's "power pose" research that Andrew Gelman and I co-authored in 2016 as an example of a vicious personal criticism on Cuddy, in fact, one of the very few actual examples of vicious personal criticism that Dominus repeatedly told readers has destroyed civility in the field of social psychology.

You can now read my response to the NYT Magazine article in the now-published Letter to the Editor. (Scroll to the second entry.) I didn’t say much before because I didn’t want to front-run the letter but I do have lots to say. Since the NYT letter is limited to 200 words, it is a much abridged version of what I had in mind.

I find Susan’s interpretation of the events to be unbalanced. If you are a curious, or serious, reader, I urge you to follow the links below (in addition to reading Dominus's feature article), and form your independent judgment of the movement to reform the use of data in psychological experiments.

Susan Fiske, who is the academic advisor of Amy Cuddy at Princeton, is mentioned in the NYT Magazine. An important contribution of hers to this debate is a "Presidential Guest Column" in APS Observer, titled "Mob Rule or Wisdom of Crowds?," in which she coined the term "methodological terrorism." Prof. Fiske is a past president of Association for Psychological Science (APS). The entire column can be read here.

The NYTM article mentioned a “combative” response by John Bargh, Yale professor and father of the seminal priming study, to the failed replication study. He published his response on Psychology Today, a blogging platform. The blog post was subsequently deleted but you can find the post titled “Nothing in Their Heads” archived here.

Daniel Kahneman's letter to priming researchers is cited by Dominus, and deserves to be read in full. Dan Goldstein has a copy of it here. There are many good quotes, including "Your problem is not with the few people who have actively challenged the validity of some priming results. It is with the much larger population of colleagues who in the past accepted your surprising results as facts when they were published."

Andrew Gelman, professor of statistics at Columbia, regularly blogs about the validity of social science research at his popular blog. As Dominus noted, the power pose research started by Dana Carney, Amy Cuddy and Andy Yap is one recurring topic on the blog. Gelman has many other favorite, recurring targets, for example, Daryl Bem (ESP research, Cornell), Brian Wansink (Food and Brand Lab @ Cornell), Richard Tol (climate economics), Roy Baumeister (ego depletion), Satoshi Kanazawa (evolutionary psychology, LSE), etc. I can't help but notice that almost everyone on this list is male.

The Data Colada website, run by Uri Simonsohn and Joe Simmons, is only one click away. Given that a whole meal was made of the inner workings of a single blog post that criticized the Carney, et. al. study, the reader should absolutely check out this website where serious scientists are discussing how to reform science.

The original paper by Carney, Cuddy and Yap (2010) that launched this power pose research program is found here. If you don't have time to read the whole paper, at least read the abstract, for what their original scientific claim was:

Humans and other animals express power through open, expansive postures, and they express powerlessness through closed, contractive postures. But can these postures actually cause power? The results of this study confirmed our prediction that posing in high-power nonverbal displays (as opposed to low-power nonverbal displays) would cause neuroendocrine and behavioral changes for both male and female participants: High-power posers experienced elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk; low-power posers exhibited the opposite pattern. In short, posing in displays of power caused advantaged and adaptive psychological, physiological, and behavioral changes, and these findings suggest that embodiment extends beyond mere thinking and feeling, to physiology and subsequent behavioral choices. That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

Here is the Ranehill, et. al. replication study. Here is the full text of Dana Carney's renunciation of power pose research, in which she disclosed how the steps taken during the data analysis could have led to a false-positive finding. Here is the full press release by Amy Cuddy's publicist to New York Magazine, which is Cuddy's first public response to the Ranehill et. al. study.

Here is the article Andrew and I co-authored about our general concern about popular press and social media promulgating peer-reviewed studies that have weak statistical foundations, using the power pose research as an example.

****

One thing to which the NYTM article did not do justice is the underlying, deep issues related to how evidence from data is used in social science. In November 2016, I wrote a series of blog posts to explain the statistical issues being discussed. The agenda:

It's hard to keep up with Andrew Gelman, so let me point to some interesting recent posts from his blog.

Readings on philosophy of statistics (link): Andrew has a bunch of links of (mostly his own) writings about deep statistical issues. Science is about understanding how the world works, which involves questions of cause and effect, and randomness and unexplained variability. Data that can be observed are almost never sufficient to establish cause decisively but statistical theories can be drawn upon to make careful, principled conjectures. These statistical methods are not infallible, and are subject to abuses, both malign and unintentional. Recent work has uncovered that lots of results from all kinds of fields (psychology, social psychology, evolutionary psychology, medicine, cancer studies, etc.) cannot be replicated, raising concerns about abuses. Andrew - as well as commentators - compile a list of readings for those interested in this ongoing controversy.

An elementary error showing up in JAMA (link): misinterpreting p-values - every elementary textbook warns against such erroneous claims

Another one on Gelman's favorite subject - the garden of forking paths leading to over-confident statistical conclusions. I once summarized his arguments in a series of posts: 1, 2, 3

Some commentary on Mechanical Turk and the general issue of measurement and data quality (link) This is an important topic in Big Data. I will be writing about a study that looks at weather as the explanatory variable. Weather is derived by looking up someone's IP address and then the weather report at that IP address. One should ask how accurate the measurement of weather was for this study.

I don't speak often at marketing conferences and that's because my message is not easy to take. For example, one of my talks is titled "The Accountability Paradox in Big Data Marketing." Google and other digital marketers claim that the ad-tech world is more measurable, and thus more accountable than the old world of TV advertising - they claim that advertisers save money by going digital. The reality is not so. There have been some attention to this problem recently (for example, here) - but far from enough.

Let me illustrate the problems by describing my recent experience running ads on Facebook for Principal Analytics Prep, the analytics bootcamp I recently launched. For a small-time advertiser like us, Facebook presents a channel to reach large numbers of people to build awareness of our new brand.

So far, the results from the ads have been satisfactory but not great. We are quite contented with the effectiveness but wanted to run experiments to get higher volume of "conversions". This last week, we ran an A/B test to see if different images result in more conversions. We designed a four-way split, so in reality, an A/B/C/D test. One of the test cells (call it D) is the "champion," i.e. the image that has performed well prior to the test; the other images are new. We launched the test on a Friday.

Two days later, I checked the interim results. Only one of the test cells (A) had any responses. Surprisingly, that test cell A has received about 90% of all "impressions." Said differently, test cell A received 10 times as many impressions as each of the other three cells. The other test cells were getting such measly allocation that I have lost all confidence in this test.

It turns out that an automated algorithm (what is now labeled A.I.) was behind this craziness. Apparently, this is a well-known problem among people who tried to do so-called split testing on the Facebook Ads platform. See this paragraph from the AdEspresso blog:

This often results in an uneven distribution of the budget where some experiments will receive a lot of impressions and consume most of the budget leaving others under-tested. This is due to Facebook being over aggressive determining which ad is better and driving to it most of the Adset’s budget.

Then one day later, I was shook again when checking the interim report. Suddenly, test cell C got almost all the impressions - due to one conversion that showed up overnight for the C image. Clearly, anyone using this split-testing feature is just fooling themselves.

***

This is a great example of interesting math that looks good on paper but spectacularly fails in practice. The algorithm that is driving this crazy behavior is most likely something called multi-armed bandits. This method has traditionally been used to study casino behavior but some academics have recently written many papers that argue they are suitable to use in A/B testing. The testing platform in Google Analytics used to do a similar thing - it might still do but I wouldn't know because I avoid that one like the plague as well.

The problem setup is not difficult to understand: in traditional testing as developed by statisticians, you need a certain sample size to be confident that any difference observed between the A and B cells is "statistically significant." The analyst would wait for the entire sample to be collected before making a judgment on the results. No one wants to wait especially when the interim results are showing a direction in one's favor. This is true in business as in medicine. The pharmaceutical company that is running a clinical trial on a new drug it spent gazillions to develop would love to declare the new drug successful based on interim positive results. Why wait for the entire sample when the first part of the sample gives you the answer you want?

So people come up with justifications for why one should stop a test early. They like to call this a game of "exploration versus exploitation." They claim that the statistical way of running testing is too focused on exploration; they claim that there is "lost opportunity" because statistical testing does not "exploit" interim results.

They further claim that the multi-armed bandit algorithms solve this problem by optimally balancing exploration and exploitation (don't shoot me, I am only the messenger). In this setting, they allow the allocation of treatment in the A/B test to change continuously in response to interim results. Those cells with higher interim response rates will be allocated more future testing units while those cells with lower interim response rates will be allocated fewer testing units. The allocation of units to treatment continuously shifts throughout the test.

***

When this paradigm is put in practice, it keeps running into all sorts of problems. One reality is that 80 to 90 percent of all test ideas make no difference, meaning the test version B on average performs just as well as test version A. There is nothing to "exploit." Any attempted exploitation represents swimming in the noise.

In practice, many tests using this automated algorithm produce absurd results. As AdEspresso pointed out, the algorithm is overly aggressive in shifting impressions to the current "winner." For my own test, which has very low impressions, it is simply absurd for it to start changing allocation proportions after one or two days. These shifts are driving by single-digit conversions off a small base of impressions. And it then swims around in the noise. Because of such aimless and wasteful "exploitation," it would have taken me much, much longer to collect enough samples on the other images to definitively make a call!

***

AdEspresso and others recommend a workaround. Instead of putting the four test images into one campaign, they recommend setting up four campaigns each with one image, and splitting the advertising equally between these campaigns.

Since there is only one image in each campaign, you have effectively turned off the algorithm. When you split the budget equally, each campaign will get similar numbers of impressions.

However, this workaround is also flawed. If you can spot what the issue is, say so in the comments!

If you live in the States, and particularly a blue state, in the last year or two, it has been drilled into your head that Hillary Clinton was the overwhelming favorite to win the Presidential election. On the day before the election, when all the major media outlets finalized their "election forecasting models," they unanimously pronounced Clinton the clear winner, with a probability of winning of 70% to 99%. One should not sugarcoat these forecasts; they pointed to a clearcut Clinton victory - even the least aggressive number issued by trailblazer Nate Silver of FiveThirtyEight. In fact, on the eve of the election, Twitter was ablaze - Huffington Post threw a grenade at Nate, arguing that his prediction was too soft, not giving Clinton her fair due. Other hit jobs included a post on Daily Kos (link) and a comment from a respected Microsoft researcher on predictions (link).

One of the biggest and quickest stories to emanate from the shocking election result is the supposed de-legitimization of the election forecasting business. Many people have come up to me to mock these forecasters, and pronounce the death of the polling business. I will leave the polling business for another day - I don't believe it is not going away. The polling business is not the same as the election forecasting business. The two industries are being confounded because the election forecasters keep pointing their fingers at polls when their big calls fell short.

tl;dr Citizens shouldn't care about this election forecasting business. It's there for the benefit of the politicians. The forecasters have been over-selling these predictive modeling technologies. This is especially true if they are merely aggregating polls. If you can't validate the accuracy of such models, how much time/money are you willing to spend on them? The prediction markets people are quiet but they too have no clothes. Journalists should spend less time writing code, and get on the road and talk to real voters.

***

Nate Silver was the pioneer in this election forecasting business. While some academics have developed models for forecasting elections prior to Nate, his fivethirtyeight blog burst onto the scene and attracted a following. The New York Times took notice, and licensed his blog; from that perch, he developed a mass-market brand. Eventually, he jumped ship, landing at ESPN, which funded an expanded data journalism venture, that included sports forecasting, among other endeavors. (Disclosure: I have written features for FiveThirtyEight.)

After Nate moved out, the New York Times filled the void with a competing blog called The Upshot, where Nate Cohn took over the election forecasting business. In the meantime, other outlets such as Huffington Post and Princeton Election Consortium (by Princeton neuroscientist Sam Wang), joined the bandwagon, developing their own take on forecasting elections. For the past year or so, you can't visit any of these sites without having the latest election odds glaring down at you.

So much happened so quickly one may not realize that there is precious little history here. Nate SIlver's reputation rests on calling 49 out of 50 states correctly in the 2008 election, and then 50 out of 50 states in the 2012 election. Most of the other forecasters have only one election under their belt.

Therein lies the first problem - the election forecasting business has been dramatically oversold to the public. And yet the political forecasters are not alone; venture capitalists and the technology press have invested out-sized attention and untold dollars into the so-called predictive analytics "revolution," creating the myth that "big data" allow us to predict almost anything.

I have often reminded readers in my blog and books that all such models make errors, and frequently, errors that are significant and material. I have been alarmed by the lack of data to support the purported magic of these predictive models. Most articles that glorify this industry are heavy on hearsay and light on scientific evidence.

The 2016 election presented an emperor-has-no-clothes moment. Recall that the election forecasting business was built on top of (almost) 50-out-of-50 track records. In this election, most models got six or seven states wrong. By that metric, their performance is roughly 44 out of 50 states. That might sound like an A- (88%). Reasonable only if one ignores that the minimum grade given out is a B+.

We have been using the wrong metric (scale) all along. Everyone acknowledges that only about 10 states are truly competitive ("swing states"). The other night, New York was called for Clinton almost immediately after the polls closed, with about 3000 votes counted. There is no glory in calling New York correctly. When put onto the right scale, Nate called 10 of 10 in 2012 and 42 of 10 in 2016. Oops.

The election forecasters prefer that we don't tally things up this way, although they didn't complain when supporters previously cited the 50-out-of-50 statistic. The reason they provide probabilistic forecasts is that no one can be certain of an election outcome. That is a nice soundbite but the actions of Huffington Post and Daily Kos, among others, in calling out Nate Silver on the eve of the election suggest that they have become over-confident in their forecasting skill. They started to believe their own hype.

***

Probabilistic forecasts are very difficult to validate, especially for an event that happens only once every four years. By definition, swing states have close contests, with both parties roughly splitting the votes. Much larger samples are required to validate such calls.

Even for Nate Silver, we have his track record on three elections only, not enough to confirm his forecasting skill. The one big miss doesn't doom his work, just like the 2012 grand slam doesn't make him a genius. However, I really like his final post before the 2016 election, laying out the various factors that could upset his forecast. It is through this type of writing that many experts gained respect for his work.

For anyone invested in Big Data forecasting, you should ask yourself whether it is possible to measure the performance of the forecasting models. The U.S. Presidential election has a simple, (essentially) binary outcome, and that is already easier to validate, compared to many other domains. The other prominent failure - Google Flu Trends - also has the characteristic that a ground truth is available for a proper evaluation.

Take the Get Out the Vote predictions (another topic for a different day): if, heeding model prediction, Clinton never visited Wisconsin and then lost the state, does this show that the prediction that going to Wisconsin will yield no benefit is wrong? Since she did not visit Wisconsin, we could not have known what would have happened if she had gone there! The world is filled with similar situations; most predictions are difficult to evaluate.

If you cannot properly measure the performance of a prediction model, how much money/time are you willing to invest in it?

***

As the following graph from Andrew Gelman shows, the polling errors at the state level were not that egregious, amounting to an average two-percent error.

However, the errors are not evenly distributed across states. The errors are concentrated on red states, and they all erred in the same direction - the polls in red states consistently under-estimated the Trump vote.

This type of error is called "bias". Something systematic was skewing those red-state polls. It could be that Trump supporters tend not to respond to polls, perhaps out of distrust. It could be that women who intended to vote for Trump did not want to say it publicly - not that far-fetched if you recall Madeleine Albright's special reservation in hell for them. As Nate Silver pointed out, there were enough undecided voters to move the needle. The pollsters will be dissecting their sample populations to find the source(s) of the under-estimation.

Some people argue that one's faith in forecasting models should not be shaken, as the shocking Trump triumph is a scenario predicted by these models, and described by the other side of the 70% or 90% probabilities. That's another way of saying all forecasts come with a margin of error. I have two issues with this argument. All of the media outlets presented their prediction as a single number, typically with the spurious precision of one decimal place. The uncertainty around their predictions was swept under the rug. Some forecasters were so smitten by their confidence they started flame wars on the eve of the election, faulting Nate Silver for not being sure enough!

The margin of error is supposed to capture polling variability. If variability is the issue - and not bias, the Gelman chart above would show a different pattern: we should expect the errors to be spread out between red and blue states, and both above and below the diagonal.

***

The election forecasters also tell us everything is fine, they just need better data. Thus they deflect questions to the pollsters. The forecasters contend that their models amount to a sophisticated way to aggregate poll results. There is not much they could do about biased polls.

This leads to the critical question of the moment: why do we need an election forecasting business?

Having election forecasts does not advance our democracy. A citizen does not need to know the probability of Clinton or Trump winning to decide how he or she should vote. A citizen does not need to know how the prospect of a candidate's victory swings up or down with each passing poll weeks and months before the election takes place. If 70% or 99% were the wrong numbers to publish, what should have been the right ones - can we even answer this question after the fact?

If the forecasters are not "unskewing" the polls, and are merely aggregating them, what is their value add?

The answer may be political navel-gazing as a form of entertainment. The forecasters generate fodder for banter, such as which states are critical and what are the potential paths to victory. Maybe the following of Nate Silver and his imitators will stick around.

This election forecasting business is much more important to the politicians than to the rest of us. It helps them gauge their momentum, allocate resources, target their outreach, tailor their messaging, rally their troops, and so on. For all these reasons, they need data. They need quality data, which comes from repetitive polls, and smart analyses, including unskewing. They want our data.

This is one of the unspoken truths of the data business. Many entities want our data. They find ways to get us to give them the data, usually for free. Trickery and coercion are two popular strategies. Then, they make a profit out of this data. In some cases, the data benefit us directly but in many cases, the data enrich them, and sometimes, what the data we give up end up hurting us.

I don't think the election forecasting business hurts us but it isn't helping us either. This computing-intensive business is keeping people in front of their computers. Instead, the journalists should be criss-crossing the country, interviewing real voters, investigating, taking us beyond the talking points pumped out by the two parties.

***One group of prognosticators are conspicuously silent this election cycle, probably laying low, hoping the storm would pass. We are talking about the "prediction markets" people. You know, the "wisdom of the crowds." The people who disparage "experts" and eulogize the "marketplace" where people bid real money. Where are the grandiose claims that these prediction markets can predict almost anything better?

Yes, here it is. This comes from the Election Betting Odds site that aggregates data from the BetFair marketplace: