Posts categorized "Proxy unmasking"

In a mailing list I subscribe to, some users were not happy about the academic research using Dropbox data collected on academics, as written up by the co-authors in Harvard Business Review (link).

In a nutshell, the researchers obtained "anonymized" "project-folder-related" data from Dropbox on university-affiliated accounts, and did some simplistic bivariate correlations, and proceeded to draw several conclusions about "best practices" on "successful team collaborations." This type of research is very common in this "Big Data" age, and I have already written extensively about its many challenges.

OCCAM Data

This Dropbox dataset has all five descriptors of the OCCAM characteristics of "BIg Data". It is Observational, seemingly Complete, with no Controls, Adapted from its original use, and Merged (with data from Web of Science). These characteristics cause many problems with the analysis, which I describe below. For more on OCCAM data, check out this post, and my other posts on OCCAM data.

Found Data

Implicit in their analysis - and most other uses of "found data" - is the assumption of complete data. The authors believe that because their data consist of tens of thousands of researchers and 500,000 projects, they must have all the data. In this case, the authors knew that there are other platforms out there but they waved away the inconvenience. This implies they believe they have all of the "informative" data.

The authors also assumed that (a) all relevant collaborative research involves putting all relevant files on one of the major online platforms (e.g. nothing on Slack or emails) and (b) all project collaborators use one and only one platform. Further, they assume that everyone has highly organized and structured folder directories within Dropbox for which an external person who knows nothing about the projects, or a machine, can infer its contents. These problems arise because the researchers did not start with a research question and design the data collection. They chose to adapt "found data" to their own objectives.

Ecological Fallacy, and Story Time

A typical conclusion is "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." Top universities are ranked by the aggregate performance of all teams (that use Dropbox, and are identified correctly). It does not follow that every team at a top university is a top team.

This is an instance of "story time." A piece of data is offered about something related, then while the reader is dozing off, it is linked to a conclusion that is not directly supported by the data. That conclusion is elaborated by a lot of words hypothesizing why it must be true. In this case, they say "It’s likely that more frequent collaborations led to positive spillover of information, insights, and team dynamics from one project to another." But they provide no evidence at all for this last statement. That's just a story.

Proxy Unmasking

Here's another one of the conclusions: "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." They make a conclusion about work allocation between different collaborators but what they actually measured was the relative frequency at which collaborators accessed the project folders in Dropbox. That's a proxy measure, and it's convenient based on "found data", but not a good proxy.

Xyopia

It does not appear that a multiple regression model was run. The presentation walks through apparently a series of bivariate analyses. The word control does not appear in the entire article. So this work suffers from xyopia - in each analysis, the one explanatory variable being analyzed is presumed to be the chief and only variable that influences the outcome.

Causation Creep, and More Story Time

The authors made no attempt to establish causality at all. They just interpreted every correlation as causal. So every conclusion is "story time". They print one analysis of the data, then they draw a causal conclusion that one would believe only when half-asleep.

***

People are also upset about data privacy.

It does not appear that the academic users understood that using Dropbox means they are part of research studies.

People don't believe the data are truly anonymized. It's pretty clear that the anonymization can be easily reversed. Just take the HBR article for example. If they removed the names of the authors but retained information about the authors as: one junior faculty at Northwestern Univ. Business School, one senior faculty at Northwestern Univ. Business School, and one employee of Dropbox - I don't think you can find another article that fits those criteria. So is that anonoymous?

It's unclear if how they "anonymized folders" or analyzed them. There are folders with highly descriptive names, there are folders with partially descriptive names that only the collaborators may be able to decipher, and there are folders with names that do not identify the project (e.g. old_work). If they converted all folder names to alphanumeric strings, then all information about the contents of the folders is lost. If they don't convert those names, then there are clearly privacy concerns.

It's clear that some kind of IRB review is necessary to approve Big Data research projects to make sure privacy is protected.

Consider this paragraph from a FiveThirtyEight article about the small-schools movement (my italics):

Hanushek calculated the economic value of good and bad teachers, combining the “quality” of a teacher — based on student achievement on tests — with the lifetime earnings of an average American entering the workforce. He found that a very high-performing teacher with a class of 20 students could raise her pupils’ average lifetime earnings by as much as $400,000 compared to an average teacher. A very low-performing teacher, by contrast, could have the opposite effect, reducing her students’ lifetime earnings by $400,000.

If I had told you that students who performed higher on achievement tests have higher average lifetime earnings by as much as $400,000 compared to students who performed average on achievement tests, you'd not be surprised--unless you are a skeptic of achievement tests. This is evidence that achievement test scores predict lifetime earnings.

Now, this is not what Hanushekthe journalist wants you to believe. HShe said that high-performing teachers "could raise" pupils' average lifetime earnings. Two logical jumps are made in one breath here: one is the use of student achievement scores as a proxy for "teacher quality"; the second is "causation creep," (i.e. allowing a causal interpretation to creep on correlational evidence), which is signaled by the use of weasel words like "could" and "may".

The use of proxy measures is the source for many "statistical lies". One tool I use is "proxy unmasking". Substitute the proxy metric with the actual metric. So in this example, when I see "high-performing teacher," I substitute "high-performing student" since the observed data measured students, not teachers. The sound you hear is air rushing out of the hyperventilated argument.