Posts categorized "Politics"

Last week, I pointed out that the U.S. payroll survey uses the following accounting rule:

The Labor Department... determines the number of people on nonfarm payrolls by asking employers for a head count during the pay period including the 12th of the month. For many federal employees, that would be the pay period from Jan. 6 to Jan. 19.

Further, one only needs to have worked for one hour during that period to be counted as employed.

This weekend, we heard word that the government shutdown has no end in sight.

On Sunday (Jan 13), we heard more news - that the TSA has decided to issue each TSA worker a one-time one-day pay plus $500 bonus. This means TSA can now report all workers as having been paid during the pay period including the 12th of January.

Yesterday, on the sister blog, I posted a pair of charts that digs into how the unemployment rate is computed. This number has been in the news a lot because we supposedly have a super-tight labor market, the lowest unemployment rate since the peak of the tech boom in 2000! But the story is much more complex if one understands how unemployment levels are counted.

Another reason why counting unemployment is in the news is because of the government shutdown. And the Wall Street Journal has a nice (but rather convoluted) article about what that means for the unemployment number. This is where you can see how politics enter into the statistics - it's via rules that determine what gets counted and what doesn't.

***

As I pointed out in Numbersense (link), the Labor Department uses a specific definition of employment, and here is how the WSJ describes it:

The Labor Department... determines the number of people on nonfarm payrolls by asking employers for a head count during the pay period including the 12th of the month. For many federal employees, that would be the pay period from Jan. 6 to Jan. 19.

In other words, someone just needs to have worked at least one hour during that pay period to be considered "employed".

Based on that definition, if the government shutdown extends beyond Jan 19, then there should be a big negative impact on the unemployment rate.

***

But wait - politicians of neither party would want that to happen (in case they are the ones in power at the moment).

So what do we learn from the WSJ reporter? This:

even if the shutdown extends beyond the pay period, workers would be counted if legislation is enacted that requires agencies to pay employees for what they would have earned had there been no shutdown, a spokesman for the Labor Department’s Bureau of Labor Statistics said. Such practice would be consistent with how the department accounted for workers during previously lengthy shutdowns.

The reason given for this is precedence. But that actually means that the unemployment rate in that scenario would not accurately portray the state of employment in the U.S.

An economist cited in the article further explained that the following could also happen:

some federal employees initially furloughed have been called back to work and some are being paid with supplemental funds.

Since it only takes one hour of work during that specific pay period to be counted as employed, it's not hard to come up with ideas about how to turn these not-working people into employed people.

As a data analyst, there is no getting around learning all these nuances if you want to interpret data properly. It's the nature of our work.

In a nation put together by immigrants, and as an immigrant, it's breathtaking to experience the current climate in which immigrants are demonized. Legal immigration is being targeted in the name of rooting out illegal immigration.

The latest attack is on so-called birthright citizenship. In this article, the author cites Fox News that ran statistics on "births to unauthorized immigrants" used to support the abolishment of birthright. Fox News cites Pew but does not link to the source of the data. I traced it back to this 2015 article.

How does anyone count these births? It turns out Pew did not do a survey on this topic. Pew cites surveys run by the Census Bureau, namely, American Community Survey and Current Population Survey.

***

Fox News will soon discover that those statistics will no longer be available due to actions being pushed by the President and Department of Commerce.

You may be aware of the other controversy: the Census Bureau is being forced to add a question on citizenship to the Census. Our Supreme Court just shielded Commerce Secretary Wilbur Ross from answering questions about this decision.

If a citizenship question exists on the survey, the unauthorized immigrant will simply refuse to take the survey, or lie. Why would anyone acknowledge one's illegal status, and risk being thrown out of the country? Suddenly, the number of unauthorized immigrants reported by these surveys will be severely underestimated, and births to unauthorized will plunge as a percent of total births. Not what the anti-birthright team wants!

Note that this comment is concerned about the accuracy of the data coming out of the Census, and I am not endorsing illegal immigration.

The media have finally started to write some really nice reports on data sludge. I like this Wall Street Journal article, opening the black box of the science of secretly reading your emails.

If you use a smartphone, it is very likely that you have agreed to some app's terms and conditions, which allow them to download your emails en masse from one or more of your favorite cloud email providers, such as Gmail, Yahoo! Mail, Outlook, etc.

There was already an infamous example of this that came to light last year. Unroll.me is a service that helps you unsubscribe to unwanted mailing lists. When you set up this service, it requests access to your emails. That is the way they find out which services you can be unsubscribed from. It turns out that Unroll.me isn't really about helping you reduce email clutter - in fact, its main business is mining your inbox for shopping receipts, which can be sold to businesses which - you guessed it - want to sell you more stuff, which - you guessed it - probably means you'll receive more spam, net net. Oops. That was the sound when the company's management learned their little data sludge scheme went public.

For the Unroll.me story, I'm linking to this commentary in Venture Beat by someone who audaciously slammed Unroll.me management for audaciously claiming that their data sludge scheme was par for the course in the tech industry. This guy actually screamed: "The analogy between Unroll Me and Google or Facebook is audacious. Not to say haughty." He went on to claim that Google and Facebook keep all their data in-house. That was false in 2017, and looked even worse in light of recent revelations about data practices at those two companies.

Google, for example, has allowed, and recently expanded access by third-party developers to Gmail emails, according to the Wall Street Journal. Just like Facebook, Google has no control over how these third parties use the data. It has some language requiring these developers to agree to certain standards but those are unenforceable, and not enforced.

The WSJ article includes quotes from various participants in this data sludge industry that are false, intentionally or not.

First, they repeatedly claim they don't "read" our emails. Let's do a thought experiment here. I want to know if you are a racist. Unbeknownst to you, I got my hands on all the emails in your Gmail account stretching back 10 years. I write a computer program to look for various keywords like the N word. The program tabulates for me how many times you used each word, which days of the week you tend to say such words, which people you use those words with, the number of variations of each such word you have in your vocabulary, how many of your friends partake in such conversations and how many times they use racist terms, etc. Based on this report, I conclude that you are a racist. I may even conclude that certain friends of yours are also racist. According to various interviewees in the WSJ article, I drew that conclusion without "reading your emails."

(Lest you think the example is far-fetched, we recently heard that Facebook had tagged thousands of users with the label "treason," a segment which can be purchased by advertisers - or anyone willing to pay for this data.)

Second, the companies interviewed for the article e.g. Return Path basically claim that they have only had human beings read emails once or twice. That is simply a lie. You can't build any kind of predictive model without getting intimate with the data. Further, to understand how these models work, you have to review actual cases. Finally, when something unexpected happens, you have to look at the email contents to understand why.

These technologies have some possible benefits. If such benefits outweight the potential harm, then consumers would gladly adopt them. The data industry should be much more transparent. This ensures that the developers maximize benefits while reducing the levels of harm.

***

We've been tracking data sludge for years. For more, read this thread.

Facebook and Google have now been hoisted in front of the public, and rightfully reprimanded for their invasive data collection. (Notice, the deafening silence of our politicians and government officials on this issue.) In reality, the entire industry has been condoning and abetting these practices. You can follow my Know Your Data series of posts all the way to 2010, documenting what some of the big names have been doing in taking away every citizen's privacy.

For the author of the article, Google has over 5 gigs of data on him while Facebook has 600 megs. (We are making the assumption that we are being shown everything.)

Some of the key takeaways:

Deleting something merely disconnects you from your data. The data still exist in the corporate servers. (You should have known this when these companies tell you that you can come back and restore your old contacts, data, etc.)

The era of "anonymity" is long gone. All of the data are traced to you. Some readers may remember the days when we were told that cookies are harmless files that are not personally identifiable.

Your mistakes, bloopers, etc. are all stored. If you accidentally click on a spam ad, it is stored, and likely used by some algorithm to profile you.

Google seems to have kept not just metadata (titles of images, sent dates and recipients of emails, etc.) but all the data (photos, emails, etc.)

No one cares whether the information about you is accurate or not.

If there is one copy of this data, you can bet there are lots of copies of the data. This is called "redundancy" - you just have to keep multiple copies to recover from inevitable data losses.

At the end of last week, Facebook scrambled to get in front of some unsavory press coverage, by “proactively” suspending Cambridge Analytica – the data analytics outfit credited with the unlikely successes of the Brexit campaign in the U.K. and the Trump campaign in the U.S. – from its social media platform. It knew that The Guardian, and The New York Times were poised to publish critical articles about how Cambridge Analytica exploited the Facebook platform in building its invasive database on 50 million Americans, data that form the foundation for psychological scoring algorithms used to target and sway voters during the 2016 Presidential election.

As explained in my previous posts (here and here), data sleaze is the practice of taking and trading consumer data serendipitously. The third parties frequently utilize such data in ways that usurp the consumer’s self-interest.

Facebook’s response has led to an avalanche of negative publicity, and there is perhaps some hope that it, as well as other tech firms including Google, Twitter, etc., may finally take action to stop the data sleaze.

I’ve decided to split this post into two parts. Part 1 is about the inner working of the data-sleaze operation. Part 2, to be published later this week, is a call for industry leaders to take proactive action to curb the excesses of data sleaze. Part 1 explains the underlying technologies because one can never fully trust information coming mostly from conflicted actors.

In a Nutshell, How Our Personal Data Got Taken

Cambridge Analytica has boasted frequently to the media that they have amassed an extensive database of millions of Americans, which it has claimed is used to predict the psychological states of voters in support of election candidates who wanted to target and sway likely voters. A key part of this database consists of data gleaned from Facebook accounts. Cambridge Analytica purchased Facebook data from an outfit called Global Scientific Research (GSR), which is run by Aleksandr Kogan, an assistant professor of psychology at the University of Cambridge. GSR is a for-profit entity that he manages, separate from his academic appointment.

Dr. Kogan amassed the data by means of an online survey – which is a psychometric test similar to the Big 5 Personality Test – advertised on a service called Mechanical Turks, run by Amazon. Mechanical Turks are people willing to do “small tasks” for cents per task. In this case, the small task is completing the psychometric test. With a twist: Dr. Kogan also requires that the Turks download an app which they must connect to Facebook, and in so doing, they must permit him access to their Facebook data, as well as their network of Facebook friends.

Facebook data is valuable because they include real names and email addresses (only a minority of users block such snooping). These can then be used as match keys to connect with other sources of data, such as electoral rolls. See here for the kinds of data that Facebook allows partners to obtain.

Back in 2015 and 2016, the media were already on to this story. Several articles were published by The Intercept and The Guardian. Facebook at the time was slow to react. It wasn’t until 2016 that Facebook lawyers informed Kogan and his associates that they violated Facebook’s platform policies, and requested that these entities delete the data. Reporters tracked down some of the Facebook data, and a whistle-blower has come forward with documents, so it appears that Facebook, GSR, and Dr. Kogan have all been less than honest about the data sleaze. Indeed, at a hearing in the U.K., Facebook and Cambridge Analytica representatives denied that the political consulting firm has ever obtained or used Facebook data in its work.

The Enablers

The process of data sleaze outlined above is not unique to GSR or Cambridge Analytica. Many other companies in the social media ecosystem rely on variants. Let’s run down the list of enablers in this process.

Facebook – The popular social-media platform is at the center of this controversy, precisely because it has built such a powerful database. This database is hugely valuable to marketers who want to know what we like, and who we know. The social-media company has mastered the art of getting people to share their personal data by providing free, useful services or convenience via the platform. The same machinery that powers Facebook’s billions of revenues is driving data sleaze.

Terms and conditions and privacy policies – In every case including this one, tech firms expressly use terms and conditions as cover for invasion of privacy. They hide behind the façade of “if you don’t agree with our terms, then don’t use our service.” Later, many of these firms devolve to even more sly tactics, such as “if you continue to use our service, we assume you agree with our terms.” It’s a form of blackmail. Very few users read these terms and policies, but the businesses claim with a straight face that they have obtained permission from users to collect their data. Facebook and Cambridge Analytica argued that user permission was properly obtained. Dr. Kogan apparently disclosed to survey respondents that their data could be used for any reason. Because of his affiliation with University of Cambridge, some of the Turks were misled into thinking they were taking part in an academic study.

Bait and switch – the psychological test is a front for collecting each respondent’s Facebook Graph. The Facebook data contain information about who knows whom. Similarly, every weather app is a front for a detailed database of user locations at all times. In my view, the most important dataset Dr. Kogan wants is not the results of the psychological testing, as widely reported, but the names and emails of the network of friends and acquaintances of all those who signed up. A few hundred thousand, self-selected responses to the survey are not sufficient to create an accurate model of every American’s psychological state.

Data sharing technologies – a typical app delivers a service to users by pulling in various sources of data and integrating them. In order to support real-time sharing of data between app developers and data collectors like Facebook, data collectors set up automated processes by which apps can pull down the data. There are usually costs associated with these interfaces, especially when a sizable amount of data is delivered, which is a source of revenues for the data collectors. It’s hard to control access given that these systems allow automated bots to interface with them. Dr. Kogan, for example, could create both a good bot collecting data for academic research and a bad bot siphoning data to Cambridge Analytica.

Data governance black hole – once the data show up in one database, it is bound to show up in many databases –internally as well as at third parties. Once the data reach a third party, Facebook cannot know how many copies are made, and where those copies are. Even within Facebook, with so many employees having access to the data, it is almost impossible to monitor who has copied the data where. Facebook and other social-media outlets have community rules. For example, Facebook has “platform policies” that restrict friends’ data to noncommercial use such as improving user experience. Talk about unenforceable! Facebook can only know what data have been sent to a third party but it has no way of knowing how the third party is utilizing the data.

Data deletion myth – just as it takes special skills to eradicate a file on one’s PC, so it is basically impossible to remove all traces of data from existence. We have trouble even counting and locating all copies of a given dataset. Thus, Facebook didn’t even bother to check whether Cambridge Analytica, and selected third parties, truly destroyed the data they were supposed to. Facebook didn’t suspend the controversial company until the media unearthed evidence that the data haven’t been destroyed.

Mechanical Turks – these bit players were exploited as pawns to sell out their “friends” for mere cents.

Weak regulation and enforcement – Europe might be finally ready to enact laws to regulate the data collection industry but the U.S. government sees no evil.

Anonymity – many businesses trot out this buzz word to justify their data collection operations. To put it bluntly, we are being lied to. Anonymity is declared every time an analyst replaces identifiers (such as emails) with encrypted or scrambled versions of those entities. However, in most cases, lookup tables are available to unmask these users.

I am starting the new year with a series of posts: each reviews a key issue covered on this blog in 2017, and I will discuss what might be in store for 2018.

First up is FAKE NEWS.

What happened in 2017

If I had a vote, I’d have chosen “fake news” as the phrase of 2017. Fake news is like that famous non-definition of pornography, you know it when you see it. It is this rare chameleon that pleases both parties equally. First, it was the Clinton supporters who claimed that voters were duped by “fake news” generated by Republicans or related entities. Then, when Trump became President, he seized the phrase, and made it his own – now, liberal-leaning news organizations are the “fake news media” who disseminate false information to the American public.

The term “fake news media” doesn’t make much sense – how does one “fake” a medium? But by adding the third word, the President has shifted the context of the issue from the contents of the news to the creators of the news. He portrays an awfully naïve, black-and-white world in which a media company is either “fake” or not “fake”, where all news reported by an alleged “fake” media company are “fake news.”

The emphasis on authorship runs against the recent current in media consumption. The invention of social media and user-generated content – much of which written pseudonymously or by people without name recognition – has tapered our awareness and appreciation of who’s doing the writing. When we consume Yelp or Amazon reviews, we barely register who authored them. Mainstream name-brand media has been withering before the President ironically saved them from the quicksand of online anonymity.

Suddenly, the technology giants also found themselves in the spotlight – for their alleged role in spreading “fake news.” In particular, Facebook and Google act as gatekeepers of online content. As startups, they cultivated a brand image of being objective, non-commercial, judgement-free “organizers” of information.

The “fake news” controversy has spotlighted that in later life, Google and Facebook morphed into gigantic advertising agencies, reliant on earning advertising revenues from commercial partners. The “eyeballs” of their users are the new oil. They must keep users coming back and sticking around, like drug addicts.

Advertisers want eyeballs, and Google and Facebook are extremely efficient at delivering them, through a number of tools such as “collaborative filtering” algorithms that control what individual users see when they search the Web or open their social media accounts, and A/B tests used to discover the conditions under which users interact with advertising messages. Needless to say, a lot of advertising messages can properly be called “fake news.”

Machine-learning algorithms fundamentally measure “success” based on view counts, which can easily amplify “fake news” through an well-known echo-chamber effect. Additionally, an army of “optimization experts” is available for hire to manipulate the output of those algorithms. The various futile attempts to “get rid of fake news” has awoken users to the amount of control Facebook, Google and similar companies exercise over one’s reading materials.

What to expect in 2018

The controversy over “fake news” and “fake news media” will not end in 2018. The President will continue to call out anyone daring to disagree with him as “fake news media.” Facebook and Google will roll out and roll back various “solutions” to combat “fake news” but no silver bullet will be discovered. The two tech giants will remain gargantuan advertising agencies seeking to monetize our attention: they won’t find enough paying customers to wean themselves off from advertisers.

The good news: the failed attempts to control “fake news” will teach us a lot about the nature of the search/discovery algorithms that control our online lives. Developers of these algorithms will benefit greatly from the increased scrutiny, and need to ply open the black box. We will appreciate that algorithms are neither objective nor judgment-free.

Three dead bodies will be resurrected: fact-checking, authorship, and expertise, in that order. Media companies will come under pressure to reinvest in fact-checking; authorship will become a key weapon in the fight against lies; and consumers will re-discover the value of experts, hopefully before the end of the year.

Those who have read my books or taken my courses would be the least bit surprised by the news that just came out about the naked emperor of Chicago's much ballyhooed predictive algorithm for child abuse. Chicago Tribune has the story (link).

Key points:

Data scientists (over-)sold an algorithm to predict children "at risk for serious injury or death."

The director of the government agency that purchased this product concluded after two years that "[the model] isn't predicting much."

At least $366K were spent i.e. wasted on this program. This number most likely does not account for the costs of support staff and infrastructure and the actual waste described below.

"More than 4,100 Illinois children were assigned a 90 percent or greater probability of death or injury. And 369 youngsters, all under age 9, got a 100 percent chance of death or serious injury in the next two years" [a false positive problem - not to bring up a sore subject but several infamous outlets proclaimed that Hillary Clinton would win the Presidency with 90% or higher probability!]

Selling the dream: "If it is possible to use big data to spotlight a child in trouble and intervene before he or she is hurt, then doing so is government's moral obligation, advocates for the technology say."

The algorithm, despite being paid for by public funds, is hidden from public view but the Tribune report pieces this together: "Eckerd [the vendor] retrospectively analyzes thousands of closed abuse cases and from them draws data points that are highly correlated with serious harm. The parents' ages could be a factor — or their previous criminal records, evidence of substance abuse in the home, or the presence of a new boyfriend or girlfriend."

The story does not invalidate the practice of predictive analytics but points out pervasive problems within the industry. Most importantly, we are over-hyping this technology. I cover the statistics of such models in depth in the marketing chapters of Numbersense (link). Predictive models are impressive in a relative sense (in marketing, we use the term "lift") but not usually in an absolute sense. Every social science based predictive model I have seen misses a sizeable number of targets while falsely labeling lots of "cases".

When a vendor does not report on accuracy metrics, or when such metrics are not interpretable in practice, or when the metrics are generated by third parties, or when metrics are computed purely from historical data, one has to be very careful about separating the good products from the scams.

***

It bears repeating that the "big data" used in this type of analysis has multiple complications which I collectively describe as OCCAM.

Observational - we cannot run an experiment and assign people to have criminal records, or other characteristics at random, so it is tough to get any read on causal mechanisms. Thus, there is a good chance to suffer from spurious correlations.

Seemingly Complete - but not really. Many cases are unreported and not in the training data. One of the excuses used by Eckerd when confronted with inaccuracy is that certain children are not even scored because they do not appear in the data. That is technically correct but does not change the fact that the model did not predict those deaths.

No Controls - many predictive algorithms do not use controls explicitly. Another issue is that the possible controls all come from cases reported to the authorities. Without a doubt, if we take the profiles of the at-risk children as determined by the algorithm, we will find many other children in the general population that do not show up in the database at all.

Adapted - the data collection process was not designed specifically to support predictive modeling.

Merged - various datasets get merged during the analysis, which introduces errors.

While I appreciate that the Chicago Tribune wrote and published this article, this is yet another media report on predictive modeling using "big data" that does not contain any data or quantified metric of predictive accuracy.

The two chapters most relevant to this post are Chapter 4 in Numbers Rule Your World (link) and Chapter 5 in Numbersense (link).

Seth Stephens-Davidowitz has written a fascinating book calling for social scientists to use data collected by Google or Facebook in their research. This is a controversial issue, and if it weren’t so, it wouldn’t warrant writing a full-length book about it. Google does not release publicly its search data, but provides some pre-processed and aggregated statistics through services such as Google Trends and Adwords. Researchers who use this data do not have control over its collection or processing. (Seth previously worked at Google, and has written some columns for New York Times.) Facebook does not publish its data either, although social media users have a lower expectation of privacy.

Such big datasets come with a set of knotty problems, which I have previously summarized as OCCAM. In addition to having little control over its origin, the researcher’s purpose typically diverges from that of the data collector. The data is found or observed and not usually experimental. It is often treated as “complete” or essentially complete by the researcher, which is an assumption, not a fact. In stages of merging in other data, the researcher introduces inevitable errors. Here is my previous post about OCCAM datasets (link). It is not surprising that classically trained scientists have reservations about such datasets, especially if they are interested in causal mechanisms. Nevertheless, I agree with Seth that we can make progress on solving these problems if we start taking them seriously.

In writing the book, Seth carried out a number of mini-studies using mostly Google Trends data. Here are 8 things I learned from reading Everybody Lies:

Some people use search engines as confessionals. They type complete sentences like “I am sad.” or open-ended questions like “Is my daughter ugly?”

People assume machines (like the Google search engine) will keep their secrets. For sensitive topics, Google may generate more honest data than surveys. There are many questions asked to Google that I’m sure people won’t pose to a librarian.

Google searches for “Obama” is frequently paired with “kkk” and the “n” word. The prevalence of racist searches does not exhibit a North-South divide – it’s East-West.

As President of Harvard, Larry Summers spent quite a bit of time brainstorming with Economics PhD students on how to beat the stock market using new data. (And they came up empty-handed, or so they say.)

Anthony Weiner got rejected from Stuyvesant High School (famous NYC public school), missing the cutoff score in the admissions test by one point

Some economists found that going to Stuyvesant conferred no meaningful benefit to one’s career – at least, this is the case for those who attain a score close to the cutoff in the admissions test.

There are 6,000 searches on Google a year for “how to kill your girlfriend” while there are 400 murders of girlfriends.

“Big data” does not provide any insights that surveys can’t at the aggregate level so people slice and dice the data to examine “micro” segments, which means they are analyzing a huge collection of small data sets

The research mentioned in Seth’s book come out of the Economics discipline primarily, and can be considered in the tradition of Freakonomics (2005). There are examples of natural experiments, as popularized by Steven Leavitt. Seth brings the coverage up to the current trends, describing regression discontinuity, field experiments, and other techniques favored by econometricians at the moment. For those interested in what happens after Steven Leavitt, this is a good place to start.