Going bananas over this Seamless ad

Seamless, the online restaurant delivery service, has been running a series of fun ads on the New York subway that has a statistics theme. Here is a snapshot of one of them:

The text on the ad says:

The Most Potassium-Rich Neighborhood

MURRAY HILL

Based on the Number of Banana Orders

No One’s Cramping Here

***

This ad is tongue-in-cheek. But it's making a data-driven argument. So I started unpacking it.

The conclusion is “No one’s cramping here (in Murray Hill).” It’s an exaggeration so I’m going to read this as “Most people don’t cramp here in Murray Hill.”

The data behind this conclusion is much harder to nail down. One would think it should be the proportion of orders containing bananas in Murray Hill relative to the same in other neighborhoods. The ad uses the phrase “number of banana orders.” What does that mean? Is it “orders with at least one banana”? Or “orders of bananas only”? Or “total number of bananas ordered (across all orders)”?

Between the data and the conclusion is a long, windy path. Let me draw this out:

Assumption 1All the neighborhoods have similar total populations so that by proportion of banana orders, Murray Hill also ranks #1.

Assumption 2“Banana orders” is defined meaningfully. For the sake of argument, we’ll assume a banana order is an order that contains at least one banana.

Assumption 3The data analyst used the appropriate address data. For the sake of argument, we'll assume that the delivery address is the source of the neighborhood data.

Assumption 4Everyone who has a “banana order” through Seamless lives in the neighborhood to which the banana(s) were delivered. This further requires

Assumption 5Everyone who has a “banana order” through Seamless works in the same neighborhood as they live. This distinction is important for daytime orders.

Assumption 6Murray Hill residents who has a “banana order” through Seamless are just like other Murray Hill residents

Assumption 7The name on each “banana order” is the one person who consumes the banana(s). No dogs ate the bananas, nor did a co-worker, family member, or anyone else not known to Seamless

Assumption 9Published scientific reports reach a strong consensus on the effect of bananas on cramping (highly unlikely); or, Seamless data show that those with a “banana order” report the absence of cramps (which requires primary research). The causal interpretation further requires

Assumption 10Knowing that the people who made “banana orders” through Seamless would have suffered cramps had they not ordered and consumed those bananas. This counterfactual scenario is never observed, so instead, we accept

Assumption 10bKnowing that the people who did not make a “banana order” through Seamless did suffer cramps. This requires

Assumption 11The people who live in Murray Hill and did not make a “banana order” through Seamless also did not order bananas from a different shop, or otherwise consume bananas. In addition, we require

Assumption 12No one who is part of this analysis benefited from any other anti-cramping remedy; or at the minimum,

Assumption 13That people who have “banana orders” through Seamless, and those who don’t, are equally likely to have used other forms of anti-cramping remedy

Assumption 14One banana is effective at stopping cramps, meaning there is no dose-response effect, the presence of which would require us to define “banana order” differently under Assumption 2.

The above assumptions fall into three groups: obviously false (e.g. Assumption 1); possibly true; and most likely true. The probability of the conclusion depends on the probabilities of these individual assumptions.

***

tl;dr

Most data-driven arguments consist of one part data, and many parts assumptions. An analyst should not fear making assumptions. Assumptions should be supported as much as possible.