Her conclusion seems to be: more relevant ads is better than no ads at all. What future is waiting for cheated fe/males? A warning "Be careful to your partner" or a reassuring "All is well" to choose in advance among app settings?

Gillian is someone who totally buys into the tech industry's "big data" pitch - that the more you share, the more you gain. She's writing tags that cue algorithms to send her relevant ads. Presumably, when she was pregnant, she was satisfied with the ads that at the time were selling her relevant products.

She's mad that the algorithm is not all-knowing, personalized and omnipotent. She expects that Facebook, Instagram, Amazon, etc. tracks her every move, and optimizes her experience just for her. She's angry when it makes mistakes.

And, if one reads behind the lines, her proposed solution is for the tech industry to be even more creepy, gather even more personal data, be even more personalized. She wants ads, just not the ones she doesn't like.

This solution is not radical at all. In fact, it is exactly what tech firms have been doing for 10 years. The "theory" is: data make ads more relevant, and if ads are not relevant enough, it is because they do not have enough personal data. In this sense, Gillian's column is a love letter to the tech industry.

***

The overlooked solution is to have less relevant ads or no ads at all.

In the Charles Duhigg story about Target's pregnancy prediction model (see Numbersense), one of the curious nuggets we learned is that the data scientists deliberately mixed random products in between the pregnancy goods being marketed to the women predicted to be pregnant. The official explanation was to make the brochures appear less creepy.

In the book, I suggested a different explanation for that decision. In a predictive model like that, there are likely to be multiples more false positives (i.e. women wrongly predicted to be pregnant and thus sent irrelevant materials) than true positives (i.e. women correctly predicted to be pregnant). I also speculated that many true positives would act like Gillian did - appreciating the pregnancy product ads as relevant rather than creepy. However, I believe that the false positives will complain that the pregnancy product ads are irrelevant, maybe even somewhat offensive.

Mixing in other products lessens the bad of the wrong predictions - but simultaneously, it will also soften the impact of the correct predictions. What's in the balance is consumer interests versus advertiser business goals.

Gillian Brockell at the Washington Post published a heart-felt essay pointing out the inhumanity of algorithmic (i.e. programmatic) advertising on social-media platforms like Facebook and Instagram (owned by Facebook). It's getting some attention in tech circles, which is a good thing. (Thanks to reader Antonio R. for tweeting it to me.)

Here's a tl;dr of her story: she works at the Post as an video editor, and is very active on social media, especially Facebook. Sadly, she recently suffered a stillborn. She found that Facebook continued to bombard her with advertising of pregnancy products, which kept bringing up sad memories. When she clicked on the "I don't want to see this ad" icon, the advertising did not stop but just switched gears, now assuming that she needs post-childbirth products.

She's social-media savvy enough to realize that the advertising platforms perfected by Facebook and Google are inhuman, driven by algorithms, based on silent, pervasive collecting of personal details that are either shared publicly, shared within semi-private communities, or transacted privately. Advertisers are relying on the scoop of data to auto-pilot their advertising campaigns, most of which define the goal as ad clicks.

***

The zeal of the tech industry to sell more servers, more processors, more boxes, more tools has led to a level of over-confidence in predictive algorithms that underly Gillian's dissatisfactory experience. Such enthusiasm is matched by the advertisers' hunger for more clicks, and more sales while ignoring the tenuous link between those clicks and sales. As data scientists, it's easy to get caught up in the adulation and, using Nassim Taleb's phrase, get fooled by randomness.

It is widely held that predictive algorithms are all-knowing, personalized, and omnipotent. The reality is not so clearcut. Let's take a deeper look at how algorithms work, and from there, maybe the data-science community will come up with alterations that can reduce the chance of dissatisfaction.

***

ALL-KNOWING

Algorithms are not all-knowing despite our tendency to believe they are so. Each algorithm is driven by a fixed set of data inputs. In Gillian's article, she suggested some possibilities, such as the tags on her Instagram, the contents of her Facebook posts, or her friends' posts, Google searches, "metadata" on Amazon wishlists, etc.

All of those could be inputs in someone's algorithm but each such item has to be meticulously captured by code. Collecting every additional item requires more coding. How much is collected depends on the culture of the development team. Some are extremely zealous, others draw the line.

Every algorithm then weighs the importance of different data inputs. These weights ultimately control the actions of the algorithms. Their determination is a guessing game. It's like what factors you'd use to decide who should be a Hall of Famer in sports. No two people can agree, neither can two algorithms.

***

PERSONALIZED

No algorithm can be truly "personalized", as in one-to-one personalized. One-to-one is incompatible with Big Data. If the algorithm tailors recommendations to you just based on what it knows about you, then it likely will make lots of mistakes. Data available at the individual level is sparse, and riddled with holes.

All algorithms leverage statistical averages. It's much easier to predict what movie the average teenager would watch than the preference of each particular teenager. Most algorithms work like this: if the teenager discloses all the movies that s/he watched in the past, then that personal history will form the basis for the recommendations, which can be quite accurate; but for most people, that level of detail is not present, so the algorithm falls back on statistical averages - what the average teenager watches.

Even within individuals, taste changes over time. So even for the data-hoarding teen, the recommendation derives from a "moving" average of his or her past record, in other words, the average over time of his or her past viewings. Because of this, the longer the history, the less sensitive is the algorithm to recent data. The GPA is a good analogy. It's much harder to move your GPA in your senior year than in your sophomore year because of the accumulation of grades.

(PS. This is a call for the industry to take up the issue of right to be forgotten. Deleting old data not only pays respect to your users but also removes a source of error from these algorithms!)

***

OMNIPOTENT

The most damaging myth about predictive algorithms is their supposed omnipotence. Almost every report in mainstream media about predictive algorithms makes assertions such as "Google can predict when you will die", and "IBM can predict who's a good employee". What do they mean by "can predict"? Readers are led to believe that "can predict" means "will predict", or "will predict accurately".

In reality, there is no black-or-white, no can or cannot, no accurate or not accurate. Accuracy is important, is much lower than advertised, and is not a binary state but a progression.

Predictive algorithms have been around for a long long time. Think weather forecasts. If I say, "the weather channel can predict the weather", you do not presume they will predict every day accurately. Some of us might even think "well, they can barely predict the weather. They have the ability to issue forecasts, but that doesn't mean the forecasts are accurate or useful."

***

There are several other important aspects of predictive algorithms, which I delve into in my two books. For example, accuracy is measured several ways, and always involves a trade-off between false positives and false negatives. The algorithms behave as per the incentives of their owners. The chapter on steroids testing, terrorist prediction, polygraphs in NUMBERS RULE YOUR WORLD (link) tackles this.

Also in NUMBERSENSE (link), the chapter on Target's model to predict pregnancy walks through how one measures the accuracy of predictive models, and how incentives play a role. Nate Silver's Signal and Noise book is also recommended - he has a section discussing the accuracy of weather forecasts, and how they are systematically biased (due to incentives).

This past weekend, I found my way to West Lafayette, Indiana, to speak at the Math, Data Science and Industry Conference, organized by Math Prof. Aaron Yip and Drew Swartz. I was very impressed with the quality and diversity of the talks there. They managed to strike a nice balance between academic talks and industry talks, and the BS quotient was minimal.

I will first outline my own talk, and then in a next post, I will highlight some things from the other talks I attended.

The goal of my talk is to paint a broad-brush picture of the scope of jobs that are part of the current Data Revolution, and to give a flavor for the nature of the work so that graduate students may decide for themselves whether this "data science" industry is a good fit for them.

One key takeaway is the distinction between research jobs and industry jobs. Research jobs lead to innovative research that can be published in scholarly journals. Most industry jobs demand short-term results that impact the business, and it does not matter whether the methods used are innovative. The boom in data jobs, however, is in industry jobs. Only large corporations in cash-rich industries can afford research jobs, and even at those firms, there are hundreds if not thousands of industry jobs for each research position. Math graduates can totally get hired for industry positions, if they put in a little effort to prepare for this career path.

Within industry jobs, I like to think of three job types.

Data science jobs - these are the headline-catching jobs because they are disproportionately found in the high-tech industry. Think of these as software developers with advanced database skills. The culture here is automation, removing human beings from the process.

Business analytics jobs - these jobs are tethered to business teams, such as marketing, finance, operations and customer service. They are the champions of embedding data analyses in the everyday decision-making processes. They interact constantly with business managers, providing a form of consulting service.

Data IT jobs - these people keep the data flowing in the organization, so to speak. They are also responsible for "data governance" and standardizing the formats, definitions, quality, etc. of the data. This sector is experiencing rip-roaring growth.

There is a huge need for scientific thinkers and data-savvy people in all three job types but at least half the open positions are in "business analytics." I discuss two particular gaps in skills that hiring managers often complain about in university graduates: (a) inability to develop the question and (b) not knowing how to question the data.

There is a clear reason why such gaps exist. A typical question we pose to students in a problem set first lays out the problem to be solved, then presents the set of data to be used, and finally challenges the students to plug the data into an appropriate method or framework so that the solution to the problem drops out.

The professor is not going to look kindly on the student if s/he criticizes or revamps the question or points out flaws in the data! (University classes teach theory, models and frameworks so this is not surprising.)

This brings me full circle to the distinction between research and industry jobs. In research, you can "choose your battles" by making certain assumptions to move past obstacles. For example, you assume that the (biased) dataset that you obtained is representative - lots of research papers that use observed social-media data do this. You just argue that bias correction is a separate problem to be tackled at some other time, perhaps by some other research team.

In industry, you don't have that luxury. A great solution to the biased problem may turn out to be a horrible solution to the unbiased problem. When I was at SiriusXM, we had some data on people's online listening patterns but almost nothing on their in-car listening. Building great models using the online data isn't going to do much good because most of the listening happens in the car, and people who listen online are quite different from those who listen in the car.

Towards the end of the talk, I pointed out that in order to do well in these data jobs, one must be comforable to live in the "gray" areas. There is the gray between science and social science, between models and heuristics, between data and intuition.

People were very friendly and we had some fine conversations at a bar after the day was over. I'm happy to report that at least a few people have indicated that they want to pursue these industry jobs.