Posts categorized "Food and Drink"

I've talked about "fake data" before. A lot of fake data come from people trying to game algorithms or skew metrics, and oftentimes, automated bots are involved. Attempts to obscure these tactics typically involve creating layers of complexity so it is not easy to connect the dots.

I come across suspicious data all the time, and it's not always clear what's going on or why. So I thought I'd feature some of them here and see if anyone can figure it out.

***

This account on Yelp caught my attention because this user apparently uploaded a photo of her desktop to one of the restaurant pages. She has uploaded a total of four photos, all of them are unrelated to food. The four photos were uploaded to two New York City restaurants while she indicated that she lives in San Francisco. She did not review either of those NYC restaurants but she has written one review for a cafe in Long Island City. The review seems genuine (although it's hard to tell unless you've been to that cafe).

She has five friends. While she lives in San Francisco, these friends live in Manhattan, Brooklyn, Scottsdale and Oceanside. She has no friends in the Bay Area. None of these friends have ever written a single review and have no likes. However, each of these friends have 100-400 friends. It's not clear why one would be friends with someone on Yelp who has no reviews or likes.

***

Is this account fake? If so, why was it created? How did those photos get uploaded? How did they get placed in those particular restaurants? Who are these friends? Are they fake as well? If the account is fake, was that review also fake? Is it possible to predict that the review is fake?

So many questions, and so hard to get answers. What do you think is going on?

Just recently, I made a short clip about how Grubhub is extracting lead-generation fees from unsuspecting restaurant owners by setting up a network of shadow websites (and phone numbers). Diners who thought they were ordering directly from restaurants were instead shuffled through the Grubhub toll booth. Click here to see how this works.

Now, Vice discovered yet another of Grubhub's toll booths. This time, the toll booth is set up on the Yelp superhighway. When a Yelp user clicks on the restaurant's phone number on its Yelp page, a pop-up shows up, with two options, one of which redirects the user to a shadow number owned by Grubhub. These numbers are not labelled "direct" and "Grubhub" but "general questions" and "deliveries and orders". So now both Yelp and Grubhub are making lead-generation money off the restaurant owner.

***

This dispute is about causality! And because causality is tough to establish, it creates a gray zone of disagreement.

Ideally, the restaurant owner pays for orders caused by Grubhub's marketing activities. By cause, we mean the orders would not have materialized without Grubhub's marketing. In the examples we saw, it's highly unlikely that Grubhub did anything to cause those orders. That's because those diners thought they were ordering directly from the restaurants - they are quietly re-routed to the shadow websites and phone numbers set up by Grubhub, sometimes without even the restaurants' knowledge.

It is the secrecy that gives the game away. If Grubhub's causal value is clear, it can form open partnerships with the restaurants without resorting to trickery.

As I pointed out in the video, the majority of the digital marketing industry relies on similar tactics. Search engines do not typically send you directly to the webpage you clicked on - you are often rerouted through the search engine's server, so that the search engine can "track" the click and use the paper trail to receive lead-generation money. Anyone passing through this toll booth is counted as "causal" but in reality, many of these users would have found their way to the webpage even if the search engine didn't exist. (Consider going directly to Macys.com instead of typing Macys into a search engine.)

***

This is a great example of how data, algorithms and software are silently running our lives, and often to the detriment of those who don't understand what's going on. Our video series is a small effort to help you stay in front of these data-driven technologies.

The more you know, the more you can leverage its power, and avoid its harms.

In my new 3-minute video, I discuss the controversy about the revelation that Grubhub has been playing a dirty trick on its restaurant customers.

As I explain in the video, what Grubhub did is similar to what Google has been doing with its search engine.

Most of these online businesses make money off "lead generation" - they bring prospective customers to brands. Restaurant owners hope Grubhub will bring them additional diners. These have to be incremental diners to justify the expense, which is a sizable 20-percent fee Grubhub assesses for each lead.

Grubhub has every incentive to count each order as an incremental lead. The dirty trick involves counting orders who should not be charged the lead-generation fee. Grubhub sets up shadow restaurant websites with similar URLs that pretend to be official websites. The diner usually does not know s/he is ordering from Grubhub's website and not the official restaurant website. This bit of deception qualifies the order for the lead generation fee - if the same diner had ordered from the restaurant's own website, Grubhub would have collected a much lower fee.

Similarly, lots of us search for brand names on Google, instead of going directly to the brands' websites. This leads some brand managers to buy their own brand names from Google. Google over the years has made it harder and harder to see which result is a paid ad. If we click on those ads, the brands pay Google a lead generation fee. Nevertheless, if we are searching for a brand's name, we will be visiting that brand's website, whether or not we pass through the Google toll booth!

In Numbersense (link), I talk about the importance of counterfactual thinking, and measuring incremental rather than absolute metrics.

Seamless, the online restaurant delivery service, has been running a series of fun ads on the New York subway that has a statistics theme. Here is a snapshot of one of them:

The text on the ad says:

The Most Potassium-Rich Neighborhood

MURRAY HILL

Based on the Number of Banana Orders

No One’s Cramping Here

***

This ad is tongue-in-cheek. But it's making a data-driven argument. So I started unpacking it.

The conclusion is “No one’s cramping here (in Murray Hill).” It’s an exaggeration so I’m going to read this as “Most people don’t cramp here in Murray Hill.”

The data behind this conclusion is much harder to nail down. One would think it should be the proportion of orders containing bananas in Murray Hill relative to the same in other neighborhoods. The ad uses the phrase “number of banana orders.” What does that mean? Is it “orders with at least one banana”? Or “orders of bananas only”? Or “total number of bananas ordered (across all orders)”?

Between the data and the conclusion is a long, windy path. Let me draw this out:

Assumption 1All the neighborhoods have similar total populations so that by proportion of banana orders, Murray Hill also ranks #1.

Assumption 2“Banana orders” is defined meaningfully. For the sake of argument, we’ll assume a banana order is an order that contains at least one banana.

Assumption 3The data analyst used the appropriate address data. For the sake of argument, we'll assume that the delivery address is the source of the neighborhood data.

Assumption 4Everyone who has a “banana order” through Seamless lives in the neighborhood to which the banana(s) were delivered. This further requires

Assumption 5Everyone who has a “banana order” through Seamless works in the same neighborhood as they live. This distinction is important for daytime orders.

Assumption 6Murray Hill residents who has a “banana order” through Seamless are just like other Murray Hill residents

Assumption 7The name on each “banana order” is the one person who consumes the banana(s). No dogs ate the bananas, nor did a co-worker, family member, or anyone else not known to Seamless

Assumption 9Published scientific reports reach a strong consensus on the effect of bananas on cramping (highly unlikely); or, Seamless data show that those with a “banana order” report the absence of cramps (which requires primary research). The causal interpretation further requires

Assumption 10Knowing that the people who made “banana orders” through Seamless would have suffered cramps had they not ordered and consumed those bananas. This counterfactual scenario is never observed, so instead, we accept

Assumption 10bKnowing that the people who did not make a “banana order” through Seamless did suffer cramps. This requires

Assumption 11The people who live in Murray Hill and did not make a “banana order” through Seamless also did not order bananas from a different shop, or otherwise consume bananas. In addition, we require

Assumption 12No one who is part of this analysis benefited from any other anti-cramping remedy; or at the minimum,

Assumption 13That people who have “banana orders” through Seamless, and those who don’t, are equally likely to have used other forms of anti-cramping remedy

Assumption 14One banana is effective at stopping cramps, meaning there is no dose-response effect, the presence of which would require us to define “banana order” differently under Assumption 2.

The above assumptions fall into three groups: obviously false (e.g. Assumption 1); possibly true; and most likely true. The probability of the conclusion depends on the probabilities of these individual assumptions.

***

tl;dr

Most data-driven arguments consist of one part data, and many parts assumptions. An analyst should not fear making assumptions. Assumptions should be supported as much as possible.

Axios has an informative article about obesity, and the various remedies such as exercising, diets, and so on. Their headline is: "Health and wellness are booming, but we're fatter than ever." They have compiled some data, shown in a triplet of graphs:

The problem of obesity is complex, and fascinating from a data perspective. I devoted an entire chapter of Numbersense (link) to issues around measuring obesity.

There is much more underneath the surface than what is presented here. Let me unpack the layers of complexity.

Correlation is not Causation

The simplest issue to explain - just because statisticians have been screaming about it forever. If you look at the obesity chart and the gym chart, it is entirely accurate to say that gym membership has been rising in lock step with obesity rate during this decade. Both metrics rose by roughly 20%; and so it is very tempting to argue that going to gyms makes you fatter.

Of course, if you draw that conclusion, you've just been disinvited from the party of statisticians.

Ecological Fallacy

Here's the disturbing bit: the charts are also compatible with the opposite conclusion - that gym membership reduces obesity. This is an example of why it's so hard to interpret observational data.

Note that the data analyst collapsed a 2x2 matrix into two aggregate rates. Imagine four types of people: those with or without gym membership, crossed with those who are obese or not obese. When you're aware of the four types, you should realize that the rate of obesity, aggregated across gym membership, is not a great metric. It's pretty obvious that the obesity rate of those who are gym members is lower than that of those who do not have membership. The average rate paints them with the same brush.

In the same way, gym membership, aggregated across obese and not obese people, is not a great metric.

You can reasonably assume that obesity rate for the gym members should be lower than the average obesity rate, for example, if the average is 25%, then perhaps the obesity rate for non-gym members is 15%.

It's possible that the 15% rate has not changed over time but if the obesity rate of the non-gym-members increases, the overall obesity rate will increase (note that there are five times as many non-gym-members as there are gym members). The 15% rate for gym members could even have improved, and the overall obesity rate could still decline to 30% - it just requires the non-gym-members to get even more obese.

When aggregating the rates, some information is lost, and that weakens our ability to draw conclusions about individuals.

Indirect Metrics

Gym membership is not the same as gym usage. The gym's ability to influence obesity would require usage, not just membership.

CDC Diet Recommendation

The bit about the CDC complaining that people don't consume the recommended levels of fruits and vegetables makes me wonder if their problem formulation is overly simplistic. The dietary guidelines appear to be an optimization of nutritional benefits. But the real problem is to maximize nutritional benefits under a budget constraint. Each item in the basket of recommended foods delivers an amount of benefits at a level of cost. The total cost can't exceed the household budget.

For anyone taking a traditional class on optimization, "the diet problem" is often the first problem discussed. Here is one exposition of the diet problem.

In the leadup to today’s hearing, U.S. Supreme Court judge nominee Brett Kavanaugh produced what he claimed to be calendar entries from the summer of 1982 as evidence that he did not attend a specific party, and therefore he could not have committed the unsavory acts alleged of him.

This story perked my interest from the data perspective. What kind of information is contained in calendar datasets? And how can such data be used to support or invalidate hypotheses?

This discussion is of great relevance beyond the Kavanaugh case because the majority of the data being collected today - surveillance data, transaction data, clickstream, etc. - are all event data: they record when someone did something.

***

When I started this post, no actual pictures of the most famous calendar right now has been published. USA Today did publish a few pictures yesterday. Here is one page to give us a visual:

This post is not about the specifics of Kavanaugh's youthful activities but about the nature of calendar datasets.

A calendar dataset consists of calendar entries. When converting the above image into a spreadsheet, one would create a row for each event (so multiple rows for multiple events on the same day). Each row should contain the following data, laid out in columns:

The date and day of week of an event

The time of the event

The location of the event, address and/or directions to it

Other people involved in the event

Duration of the event

Miscellaneous notes relevant and specific to the event

Other notes unrelated to the event

In industry parlance, the above list is called a “data dictionary.” It describes the structure of the data. Any data analyst who has used such things realize that these definitions are imprecise. If one randomly selects any two calendar datasets, there will be variance in what one actually gets.

A lot depends on how the calendar owner uses the calendar! Here are some considerations:

Planner or diary. Most people I know use calendars for planning future events. Kavanaugh appeared to use it also as a diary of sorts. Some entries record past events, e.g. a basketball score. Other entries pertain to future events, e.g. going to a movie. This is a key issue in interpretation: if it’s used as a planner, then the data become frozen in time once the event is over but if it’s a diary, the analyst should assume that edits or appends may change the data even after the event date.

Planner entries are tricky because the information may not hold true beyond the date of the event – an event might get cancelled, the calendar owner might decide to skip the event, times or venues or attendees may shift, etc. To prevent mis-interpreting the data, the analyst should strive to annotate the data with whether a calendar entry is for planning or diary.

How changes are made. If the calendar is used as a diary, the owner may revise the information after the event happened, e.g. correct the list of attendees. Even if the calendar is purely used for planning, edits will be inevitable, e.g. the venue of an event might change before the event happens. How are such changes enacted? Some owners may erase and overwrite; other owners may black out and append; still others may strike out and append. In the first case, we have no trace of the change at all; in the second case, we know something has changed but not the specifics; in the third, we have both the old and the new versions of the information.

How complete is the data? A fallacy is to assume WYSIWYG. There is no guarantee that every event in someone’s life is recorded on the calendar. We can assume that every event that the calendar owner wants to record is found on the calendar. Even that assumption is not safe. It happens a few times a month when I forgot to put meetings down on my calendar (most of these I did attend, and then there are the ones that I forgot about because they are not recorded). Missing events appear as missing rows on the dataset.

Kavanaugh did not seem to care as much about the time of events. There are many entries that did not contain when he was supposed to do something. So his calendar dataset contains a large number of missing entries under the column of “time of event.”

How consistent is the data? An analyst can’t even assume that a specific person would follow strict guidelines in terms of filling out a specific item on the data dictionary. Take Kavanaugh’s calendar as an example. On many entries, he did not record the attendees of the event. But on some, he did. When the dataset shows no attendee names for a particular event, does that mean he was alone, or does that mean he chose not to list the attendees? Even if there is a list of attendees, the analyst is hard-pressed to know if the list is complete.

How reliable is the data? Some people are sloppy about details, others are meticulous. Some people correct the entries, other don’t. Some data elements might be deemed not important enough to correct.

How the data was generated. Most of the data will be transcribed (by hand or by optical character recognition software) from the paper calendar to a database. Note, however, some of the data must be inferred. A good example is item (a), the date and day of week of an event. When we read the calendar, we know by the location of the text which date the event is supposed to occur but there is no handwriting of the date! So, in order to generate that column, the analyst must extract (using my own calendar as an example) the year from the cover, the month from the page title and the day from the column name.

Wait – it gets more complicated. Imagine you fill out the space for a given day, so the additional entries are written in the margins with a guiding line to that entry. (Ouch!)

***

I can keep expanding the list of issues but you get the idea. Data is a dirty business. The worst thing that an analyst can do is to presume that the data in front of him/her is (a) complete and (b) accurate.

Now, let’s get to the logic of the Kavanaugh defense. The basic premise is that the calendar did not contain an entry showing that he was at the same party as the accuser, and therefore he was not at such a party, and therefore he could not have committed those alleged acts.

For that logic to hold, we have to believe a further set of assumptions that are not proven by the calendar dataset:

That the calendar contained every party that Kavanaugh attended during those months

That Kavanaugh primarily used the calendar as a diary recording past events, rather than as a planner for future events, that is to say, information related to the event is true or corrected after the event

That for the specific event in question, he was in “diary” mode rather than “planning” mode so that the absence of the party can be inferred to mean he wasn't there.

That when he revised entries in his calendar, he never erased past writing.

That none of the blacked-out entries contained relevant information to the current situation.

That he always listed all attendees of parties he went to.

It’s really hard to use data to prove “absence,” and this is no exception.

Any shop that uses modern, digital, connected technologies is probably collecting, storing and selling your data to someone. The people receiving and analyzing the data form a much larger set than those collecting the data. These data analysts typically ingest the data as are, and write software that controls this or that aspect of our lives. However, such data are riddled with inaccuracies and bias, which is a form of inaccuracy.

While in Vancouver last week, I encountered the following two scenarios that illustrate the fragility of data collection.

***

I purchased a drink at the Dunkin Donuts store at JFK Airport, and I noticed that the receipt said "Dine In" even though the shop is a simple counter with no tables or seats, which means 100% of orders are take-out.

There can be a number of reasons for this mis-coding:

the store manager doesn't care because it's obvious to anyone who works at the shop that every customer is take-out. Dine-in/take-out isn't a variable of any interest to management at this location since it is invariant.

the software defaults to "dine in" and the staff is too lazy to toggle it for each transaction

the employees may have been told to toggle the setting but they realize that no one cares and so do not follow the instructions

the employees are not trained to toggle the setting

there is a known bug that prevents the setting to be toggled

Now, imagine a data analyst who got a hold of this "data exhaust." This person likely has no expertise in fast food operations. Unlike the store manager or employees, s/he can't tell that when it says "Dine in" in that particular location, it actually means "Take out".

***

I went to a supermarket to buy a beverage for the road. The lady in front of me had only three items, so I thought it should be a short wait. Or so I thought!

The first item was scanned without issues. The second item was some Anjou pears, one of the most popular varieties in North America. See picture. Surprisingly, the worker didn't know what kind of pears those were, and had to ask around. That took a while.

The last item was a couple of zucchini. The worker scanned the bag, then cancelled the transaction. She apparently noticed that the system recorded those as "gray squash $1.18" when it should have said zucchini. So she unfolded a lookup table, found the line for zucchini and re-entered the number. It came out as "gray squash $1.18" again. She cancelled the item again.

The third time the charm. Maybe she read the wrong line but when it showed "zucchini," she was satisfied. The price? Exactly $1.18.

Hat #1: I'm the impatient customer waiting in line for my turn. It seems ridiculous that she spent 5 minutes figuring the code for zucchini and ended up charging $1.18, the same amount as the "error". She should have just let the mistake live.

Hat #2: I'm the data analyst possibly getting this data as "data exhaust", or perhaps the data analyst who works for this supermarket doing product sales projections. Kudos to this worker! She corrected the error at source, so that when the data flowed through the system, there was no miscoding!

Which side are you on? I have to say, in that moment, I just wanted the line to move faster, and would rather she left the data error in the system.

***

In my OCCAM framework for thinking about contemporary datasets, these are problems related to "adapting" data collected by other people for other purposes. You are never sure whether the data measure what you think they are measuring. Miscoding can be caused by management decisions, laziness, inattention, expediency, etc.

Most importantly, in both these examples, the data would come in looking fine. It's a miscoding, not a typo, not missing. Further, collecting more data does not solve the problem. It may even reinforce existing errors by having more samples of the wrong things.

One of the key differentiators of a good data analyst is whether s/he will diagnose tricky problems like this.

Inside elevators in commercial buildings in New York, there are frequently screens that pump out commercials or infomercials all day long. I took a snapshot of this one that showed up last week:

These poll results annoy me in many ways. Let me explain.

The conclusion "free snacks is the key to happiness" is clearly wrong, even if the data were trustworthy (I'll talk about the data in a future post). I call this "xy myopia" or xyopia, in short. The writer of the poll has narrowly framed the problem: in this xyopic world, happiness hinges on just one factor - whether one's employer offers free food. In the real world, the happiness of an employee depends on a host of factors, of which free food is likely to be a minor one.

Xyopia arises from the reasonable concept of ceteris paribus ("all others equal"). Ceteris paribus is of little value in data analysis - it is often implicitly assumed by the analyst, but the assumption is false, which leads to bad conclusions.

In this case, the analyst just assumes that the two groups - employees getting free food, and those not getting free food - are equal on all other dimensions. So not true! The employer who offers free food is more likely to treat employees better across the board, such as providing more vacation days, better benefits, allowing work from home, etc. Besides, free food is only relevant in office settings (think Starbucks baristas). Certain demographic groups may value free food more than others.

Unless it is established that the respondents who got free food are identical to the respondents who didn't in all other ways, the conclusion that free food is key results because it is the only factor you analyzed. This is xypoia.

(Not to mention, "free snacks" is not the same thing as "free food.")

***

Now, think about what question the poll writer asked that led to the headline "Free snacks is the key to happiness". It appears that they asked a pair of questions: "Does your employer offer free food?" and "Rate your happiness at work". Asking the first question frames your answer to the second question.

Based on this poll, I intend to conclude that paying for personal development is the key to employee happiness.

***

If you extend your vision to the basket of factors that affect employee happiness, such as their pay, benefits, treatment, work-life balance, family life, etc., you'll recognize more problems with the data analysis.

The poll result of 67% versus 56% is a strong result only if the gap is wider than for other factors, such as the poll I created above. I might find that 75% of the employees who get reimbursed for professional development answer extremely or very happy.

Also, consider a poll with three questions:

A) Does your employer provide free food?

B) If answer to A) is yes, do you consume the free food provided by your employer?

Typing "health benefits" into the search box on Google News gives me a long list of activities and foods that are supposedly good for me. Scanning the headlines, I see fasting, running, garlic, honey, tumeric, amchur, drinking hydrogen peroxide, gardening, low-carb diets, green tea, hummus, etc.

I am a skeptic when it comes to these claims. Here's how I think about them.

Hypothetically, someone – hopefully a nutrition scientist – is making the claim that a substance X will lead to an improvement in health metric Y. The effect of X on Y has a direction and a magnitude. The direction of the effect can be positive or negative while its magnitude is either large or small. So, any effect X on Y is captured by one of the four squares as shown.

Most reported effects X on Y tend to be small and positive. As such:

1) We are talking about a small average effect. This is wrongly interpreted as meaning everyone who takes substance X will accrue a small benefit on Y. That's usually not how we obtain a small average effect. A better interpretation is that a small proportion of people who take substance X will accrue a benefit. Most people who take X won't.

You might still argue with me in this way: I understand I am just buying a lottery ticket; the chance of winning is low but what's the harm? This takes me to:

2) When the effect size is small, it's possible that the direction of the effect is wrongly measured. The difference between a small positive effect and a small negative effect is not much! Instead of a benefit Y, you might end up with a harm Z.

The kicker: if the small magnitude of the benefit is enough for you to take seriously, then you should also worry about an effect of equal but opposite magnitude.

Or, you just decide that you don't care about these little effects, which is why I don't pay much attention to those news stories.

Every year or so, another huge food recall makes the news, and it makes me sad.

The latest is the recall of 200 million eggs (link) due to salmonella "risk."

The identification of the cause of such disease outbreaks is a great case study of causal investigations, which I cover in detail in my book, Numbers Rule Your World. In that section, I also raise questions about the (in)sanity of these broad-based food recalls.

How does a food recall save a life?

All of the following have to happen:

Some proportion of the recalled eggs has to be contaminated

The contaminated eggs have to be purchased by consumers

The contaminated eggs have not already been consumed or discarded before the recall (while the investigation was happening. From this epi curve, it looks like the greatest concentration of reported cases occurred in Nov/Dec 2017, four months ago.)

Those who bought the eggs learn of the outbreak and subsequent recall

Those who knew of the recall take action to return or dispose of their eggs

Someone would have eaten those tainted eggs if they weren't recalled

When eating the tainted eggs, the consumers do not cook them thoroughly

The immune systems of those people who ate the tainted eggs fail to fight off the illness on their own

The sick person decides to go to a hospital, thus identifying him/herself to CDC

The sick person does not recover, typically due to having some pre-existing conditions that weakens his/her defence

The short version: the saved life has to be someone who would have died from consuming the tainted eggs were they not subject to this massive recall.

Each one of the above steps happen with some level of probability. For all these steps to happen at the same time is highly unlikely.

The economic fallout of dumping all those eggs is certain. The benefit of the recall in human lives is highly uncertain. So, it's not clear to me that these recalls are reasonable.