Posts categorized "Data"

It's always frustrating to see media outlets play loose with facts. When I read a headline like "Shake Shack's New Cashier-Free Location Really Does Reduce Wait, So Far," (link) I am expecting to read some evidence that the wait times of customers have decreased. Granted, this is a hard problem because you always need a "control" - something to compare to. And when this is a new restaurant, in a new location, with a new operating mode ("cashier less"), how does one find a comparison?

"Reduced wait" is not a vague assertion; it is a direct claim that the wait time (on average, one presumes) has decreased. What is the evidence the reviewer used to draw this conclusion?

It occupies the same building as IBM's Watson, the computer that won Jeopardy.

These devices do away with the double wait at most branches, whereby there’s a line to place an order and also a knot of customers waiting for their orders to be completed.

At noon on the first day, there were enough terminals that nobody had to wait. [He told readers that there were 8 such terminals.]

it took 15 minutes to get a hot chick’n sandwich, bacon cheese fries, and a bottle of branded water.

Yet, the place was not as crowded as might be suspected, given the hoopla over the new ordering system and seeming perfection of the location.

the lure of getting one with no Madison Square Park line will tempt me to get another one soon.

Those were all the sentences in the report that concern waiting or lines.

So, we learn that there was no wait to put the order in - but is this because the new cashier-free system is more efficient or, as he suggested, because there were few people in the restaurant? Is it a good thing or a bad thing that eight terminals were enough so that "nobody had to wait"? The fact that you have to order at the terminals then wait for your food - how is that not a "double wait"? Oh - that's because there were fewer than eight people at the order queue.

The biggest sign that this article is a planted advertisement is that the reviewer disclosed in the second paragraph that the restaurant "opened today". A later sentence disclosed that the reviewer was in the store at noon. How is it possible to know whether wait time is increased or decreased on the same day that the restaurant opens? Eater does not disclose that this article is bought, neither did the reviewer disclose whether he has been paid directly or indirectly for writing this article... but something smells fishy, and it's not the Shake Shack sauce.

***In Chapter 1 of Numbers Rule Your World, I discussed the mathematics and psychology of waiting. In particular, the congestion problem in road networks is impossible to solve. That's because if you add capacity by building a new road, this may reduce the average travel times but then more drivers will show up on this road as they change their routes, which makes the congestion come back. Similarly, if cashier-free restaurants process customers faster, reducing wait times, it causes those people who would have gone elsewhere because of wait time to come back, making those lines longer again.

In the meantime, customers who want to pay cash are turned away. Customers who want receipts have to find a printer elsewhere. Customers whose credit cards are temporarily blocked go hungry. Customers who have special needs are not getting help from those machines. Customers are effectively used as free labor to serve themselves.

Here is a problem staring many digital/Web/social media analysts in the face today: what if you are told that the majority of the data you have been dutifully reporting, analyzing and (gasp!) modeling are fake data?

By fake data, I mean, useless numbers that have no bearing on reality: visits to websites that never happened, clicks on ads by hired hands, clicks on ads by bots, clicks on ads that are buried layers deep invisible to any humans, video "views" that result from automatically playing clips, video "views" that last one second, ad reach (i.e. number of people who have seen the ad) that exceeds Census counts, reviews planted by hired hands, etc. etc.

Every one of the above is not fictional but the reality of the uncontrolled and unaudited, increasingly machine-driven and complex, secretive world of digital advertising. All major players - Google, Facebook, Microsoft, ad networks like AppNexus and Mediamath - are implicated.

I raised the alarm two years ago in an article at Harvard Business Review, featuring the work of leading ad fraud researcher Dr. Augustine Fou. Recently, there is tidal wave of news reports about all kinds of ad fraud and fake data.

I have invited Dr. Fou to comment on this fast-developing situation in the Principal Analytics Prep Webinar on Wednesday night. Learn more about the Webinar and register for free here.

***

The focus of most news items are from the perspective of brand advertisers who belatedly are waking up to the huge amount of dollars wasted. And a big story is being missed. Such waste was enabled by massive amounts of data that we now know are fake.

What about the zillions of reports, analyses and models created over the last 20 years by countless data "scientists" and analysts, in which the data from Google, Facebook, and myriad digital marketing vendors are taken at face value as accurate?

In fact, the digital advertising industry was built on the promise that it is more measurable, more accountable and more cost-effective. What Dr. Fou shows is that only basic statistics is needed to uncover such fraud.

Data cleaning is a huge time sink already without fake data - now, we have to wrestle with mountains of fake data. But that is the reality, and we have to rise up to it.

After the NBA Hackathon (see report here), I caught up with the winning team in the business analytics competition, DataBucket, composed of Barbara Zhan and Harold Li.

Junkcharts: Congratulations for winning the business analytics competition at the NBA Hackathon. As a judge, I was very impressed by how much work you were able to do in 24 hours. Did you sleep or did you work all the way through?

DataBucket: We slept for around 5 hours in the early mornings, but also took breaks every few hours just to relax our minds and recalibrate.

JC: The problem you chose to tackle is to define "entertainment value" for any NBA game. That's a huge problem to tackle in 24 hours. How did you allocate your time?

DB: We spent the first few hours planning our course of action, and really debating how to evaluate "entertainment value." Without a good metric, any sort of analysis would be fruitless. We also decided on our methodology (a time-series regression approach) and the features we wanted for our model.

Afterwards, we divided and conquered, cleaning / scraping the various datasets to get the variables that we wanted. Once we had a clean dataset, we ran regressions and played with features to get the most accurate and most intuitive results.

Once we were confident with our model, we spent time building out the Tableau dashboard that visualized those entertainment values. It was important for us to come up with a tool that was engaging, interactive, flexible and informative, so we spent significant time designing our visualization.

JC: How did you allocate work between the two team members?

DB: It was a joint effort! We came up with the initial plan after an hour or so of joint brainstorming. Barbara took the lead on the feature engineering / data modeling sections, coming from a quantitative hedge fund background where she knew a ton about regressions and the assumptions behind them, but both of us were highly involved in data cleaning and modeling. Harold took the lead on the visualization / presentation component, since he comes from an analytics background where storytelling and communicating results in a business context is vital. He created a Tableau dashboard that showcased our resulting entertainment metric, which updated over time, and lent a crucial "cool" factor to our presentation.

JC: Tell me about your backgrounds.

DB: We both majored in Operations Research and Financial Engineering at Princeton. Harold is a data scientist at Blue Apron, and previously worked at Goldman Sachs as a quantitative strategist. Barbara is a quantitative researcher at Two Sigma.

JC: I heard you guys have a blog called DataBucket. What's the origin of the name?

DB: When we were both at Princeton, we thought it would be fun to use our data science skills to answer questions we were interested in. Our first article sought to quantify the clutchness of NBA players, so we called it DataBucket to honor the basketball-related heritage of the blog!

JC:Your team chose to work on the problem of defining entertainment value of an NBA game. You incorporated data from Instagram into your solution. Can you explain what data you pulled from Instagram and how you used them?

DB: The prompt suggested that we incorporate creativity into our project, so we decided to use alternative data. Barbara was familiar with the Instagram API, having used it before for the DataBucket blog, and scraped quantities of hashtags related to each player's name as a proxy of player popularity. Harold was familiar with Google Trends, which he used to scrape timely data on search terms that would be most relevant to a blockbuster game (i.e. NBA on TNT).

JC:One of the highlights of your presentation is the decision making tool you created for the manager. What tools did you use to build it?

DB: We used Tableau to visualize our dataset. Given the time constraints, Tableau was the easiest tool to create something interactive and visually appealing without much effort.

Running the regression and cleaning the data was in Python and R - we used whatever we were most comfortable with and went with it!

JC:If you had a chance to do one thing differently in the Hackathon, what would it be?

DB: We definitely wanted to explore Twitter data a bit more. While Instagram is a good indicator of player popularity, Twitter is more of a real-time platform that captures more accurate sentiment of a particular game, but we couldn't hack the API in time.

On another note, we would have loved to forecast game-level entertainment value for this upcoming season instead of validating our model for this past season.

The current blowback on digital advertising giants, Google and Facebook, comes as a surprise to me. I have written about the many problems of digital advertising performance measurement over the years, sometimes on this blog, but also in my public talks. But I was skeptical that there exists an incentive to point to the elephant in the room. My talks about accountability in digital marketing has had a shelf life of a single delivery.

So, I was pleasantly surprised to learn that the CEO of Restoration Hardware, a high-end furniture retailer, recounted an internal conversation he had with his marketing team about Google Adwords at a conference organized by Goldman Sachs. Zerohedge (link) has a delectable excerpt from those remarks, but before I get to those words, let me provide some context for those not familiar with how Google Adwords functions.

Google makes dozens of billions of revenues each year from its Adwords product. Advertisers pay Google money to put their “search ads” adjacent to Google search results. These Adwords ads are shown above the so-called organic (i.e unpaid) search results, or to the right of the main results section, or increasingly, also at the bottom of search results. It used to be clearly delineated with a yellow background but over the years, Google has deliberately reduced the design differential between the paid and unpaid results, thus driving up their revenues via unconscious clicking on those Adwords ads.

These pictures show how Google search result display has evolved over time:

For example, Pottery Barn, one of RH’s competitors, buys the keyword “Pottery Barn.” This means that when a Google user types “Pottery Barn” into the search box, an ad from Pottery Barn shows up as the top result on the page, in the Adwords section which is inserted above the unpaid results. The first unpaid result is the home page to Pottery Barn. So the user has the option of clicking on the ad (an action that causes PB to pay Google) or clicking on the first organic result below those ads (an action for which Google does not directly collect revenues). Whichever option is chosen, the user ends up on Pottery Barn’s home page. See the following illustration:

The CEO of Restoration Hardware described how he decided to curb spending on Adwords. In the past, RH did exactly what PB did above. RH buys the keyword “Restoration Hardware” and pays Google every time a user searches the keyword “Restoration Hardware” and clicks on the first link on the page, which is the ad that is formatted to look like organic, unpaid results. Today, RH no longer buys those ads.

The following is an amusing description of the interchange between the CEO and his marketing team, which eventually led to pulling the Adwords budget [bolding is due to Zerohedge, not me]

We had our marketing meeting in the company several years ago and the online marketing team was pitching to double their budget, right, and at the time, say, look, nobody in the company is doubling their budget. But tell me why you believe that's the right thing to do. And they said, well, look, our customer acquisition cost and our ad cost is the lowest in the company. And I said, well, tell me about the data, show me how. And they said, well, people who click through the words that we buy on Google, the ad cost was lowest. And I said, how do you know that they're clicking on the word and going to the website because of the word you bought versus they saw a store or they received a source book? They said, oh, we know.

I said, well, how many words do you buy? They said 3,200. 3,200 words. I said, well, what are the top words? How are they ranked, the ranking of the words? Oh, we don't have that, right. And I was getting the look at like, oh, Gary is kind of one these old brick-and-mortar guys. He just doesn't get it.

And I said, well, what are the top 10 words? And they didn’t have the information. I said, why don't we cancel the meeting and come back next week when you have the data? I'm sure that Google sales representatives who are taking you to the expensive lunches and selling you the 3,200 words have that data. So why don't we get the data and then let, review the data?

And they came back the next week and we sat in a meeting and all of a sudden, I can tell you there's a little change in the faces. They had to wear it kind of down. Everybody kind of came in. I said, so what did we find out?

And they said, well, we've found out that 98% of our business was coming from 22 words. So, wait, we're buying 3,200 words and 98% of the business is coming from 22 words. What are the 22 words? And they said, well, it's the word Restoration Hardware and the 21 ways to spell it wrong, okay?

Immediately the next day, we cancelled all the words, including our own name. By the way, we are paying for the little shaded box above our words and said, oh no, we have to hang on to that because Pottery Barn might squat on top of us. I said, excuse me? I said, if someone goes to a mall or a shopping center and they're going to Restoration Hardware and there's a Pottery Bam there, they're already squatting, okay? It doesn't mean they're going to go into their store. If somebody wanted to buy a diamond from Tiffany and just because Zale's is sitting on top of them in a shaded box doesn't mean they're going to go to Zale's and buy a diamond.

I mean, I can't believe how many companies buy their own name and they're paying Google millions of dollars a year for their own name, like maybe if this is webcast, right, a lot of people are going to go, holy crap. They're going to look at their investments. They'd go, maybe we don't need to buy our own name. Google's market cap might go down...

***

What are the key lessons from this anecdote? Let me summarize them here for you:

1. This is a CEO I want to work for, if I were in the job market. This is someone who asks the right questions, and uses data in the right way to make good business decisions. Bravo!

We encounter several of the key selling points used by Google (and other digital advertising agencies), and amusingly, this CEO shot them all down.

2. The first selling point is that smirk: anyone who questions the effectiveness of digital ads is the old guy who doesn’t get digital marketing. Well, this CEO still wanted the marketing team to produce data, and so they went back to find him the numbers.

3. The second selling point used by digital advertising agencies is convenience: the vendor does all the work in delivering the ads, and analyzing and reporting performance; the user just needs to pay the invoices, and copy and paste the performance numbers to their presentations. You pay someone to do a job and you ask that person to tell you whether he/she did a good job or not – well, what answer do you expect to hear? In this case, when the CEO has questions about the data, the marketing team goes to the Google sales rep to get the answers.

4. Like almost everyone, the CEO learns that substantively all of the performance is due to “branded” keywords, that is to say, keywords containing their own brand name – Restoration Hardware, and 21 different mis-spellings of the words. Yes, that means for many Google users wanting to visit RH’s website, instead of going there directly, they type Restoration Hardware into Google, and click on one of the first links to get there. Google is acting as the toll booth that these users pass through on the way to RH. Advertising sales reps have a way to counter this argument.

5. They say that if you don’t buy your own brand name, your competitor might scoop you. This is indeed true. The picture below shows that Pottery Barn has purchased the Restoration Hardware keyword, after Restoration Hardware stopped spending on Adwords. You see that the first link goes to Pottery Barn in the Adwords area while the first link in the organic search area goes to Restoration Hardware.

The CEO has a perfect response to this selling point. “If someone goes to a mall or a shopping center and they're going to Restoration Hardware and there's a Pottery Bam there… It doesn't mean they're going to go into their store.” So true. If someone is specifically searching for “Restoration Hardware”, what’s the chance that this person sees a Pottery Barn ad and click there instead? Not high, I reckon.

This week, we are celebrating a company for not doing something outrageous.

There are reports (e.g. NPR) that Uber decided to pull the plug on one of its more invasive data collection schemes. If you had the Uber app, it was tracking your location for an additional five minutes after you finish your Uber trip, even if the app is closed. Now Uber says it will stop doing it.

Uber previously argued that the data of your whereabouts five minutes after you get off the Uber car can be used to "improve pickups, drop-offs, customer service, and to enhance safety."

This episode reveals a few things about mobile app technology. We learn that

Apps are not really off when they are turned off

Apps may be collecting all kinds of data about you and your phone 24/7/365, whether it is on or off (Previously, Uber has been caught tracking users even after they delete the app.)

Unless the app developer discloses the information, the app user does not know what data are being collected by the app at any given time. This is a similar situation to not knowing whether the camera on your laptop is filming you or not.

Won't do and couldn't do are two different things. For example, it's not a technical limitation that Uber did not track you beyond those five minutes. If they wanted to, they could have. So, users better trust the app developer when they say they won't do something.

As app developers say, if you don't like their behavior, don't use their service. In the case of Uber, it is a not an essential app so that statement is true.

The other day, I noticed that the Peet's Coffee near me added a footnote to their menu sign.

I'm talking about this note at the bottom:

Milk-based beverage calories calculated using 2% milk, except for Havana Cappuccino and Black Tie.... 2,000 calories a day is used for general nutrition advice, but calorie needs vary...

These few sentences really speak to why an analyst must know how metrics are defined before he or she can interpret them.

What I learned about measurement from this calorie estimation footnote:

The number of calories for the same beverage depends on assumptions, such as whether one is using 2% milk, or no-fat milk, or whole milk

Definitions have exceptions: here, Havana Cappuccino and Black Tie are exceptions - they didn't say in what way but I know those two drinks use condensed milk

There may be other hidden assumptions

The calorie metric only attains meaning when a reference level (daily calorie need) is provided

The reference level is provided as a one-size-fits-all number; even the writer of this note realizes is too generalized

The reference level is defined with another pile of hidden assumptions

I wonder if the calorie counts come from formulas or some kind of lab measurement

Whether one believes the calorie estimation on restaurant menus depends on whether one knows the assumptions used to produce those estimates. Also, trusting a metric is just one step; one must then figure out how to use the metric for a purpose.

Something truly game-changing might have happened. According to TheNextWeb, a judge has told Linkedin (now owned by Microsoft) that it cannot stop people from using scripts to scrape public data off the Linkedin website.

I have mentioned this issue before - web scraping is in some kind of legal gray area. Web scrapers appear like human beings to a website, and these bots just exist to collect data off the website. It's clear that owners of many websites (Kayak, Linkedin, ESPN, etc.) do not like web scraping. They implement technologies to block such bots.

There are several legitimate business reasons for opposing web scraping. For example, a retailer like Amazon does not want competitors to know all of its pricing. The downside of convenience - 24/7 shopping - is that it reduces information asymmetry. In the pre-Internet days, a store can run special prices in certain locations without word spreading to other locations - but this is no longer the case. Information asymmetry allows profits to materialize.

Besides, the data might have been purchased for real money from some vendor. In this case, the vendor will forbid the buyer from publishing all the data, otherwise the vendor loses revenues when some potential customer scrapes it off the buyer's website. The buyer also has an incentive not to give way the data assets, say, to a potential competitor.

From the web scraper's perspective, the data are publicly displayed so why should there be a restriction on usage? For social-media sites, the data are shared by users, not generated by the site owners, and so ownership of the data is confused. When site owners include a prohibition on web scraping on their terms and conditions (knowing full well that most people do not bother reading them), this creates at minimum an ethical issue, and possibly a legal issue. However, with the Linkedin ruling, the cloud may be clearing.

I first mentioned "data sleaze" in reaction to the article disclosing that Uber was secretly collecting certain user data. In my previous post, I define data sleaze as:

Data sleaze is the data about one's own customers that are obtained secretly by businesses, and then sold to the highest bidders, also in secret transactions. The production of data sleaze is frequently justified by giving services away for "free." However, running a business as a "free service" fronting a profitable espionage operation is a choice made by management teams, not an inevitability. Indeed, many businesses that have a proper revenue model also produce data sleaze.

Data sleaze is going to explode with the introduction of so-called Internet of Things (in short, tech firms intend to have every device in your home beam data back to home base all day long). A warning of what's to come is in this Gizmo article about Roomba, the vacuum-cleaning robot. Gone are the days when customers cough up their data in exchange for some "free" service. Roomba's product is not free. The robot collects data about homes while it roams around performing its advertised function - cleaning the floors. The device creates a detailed map of the home of its user. Roomba then turns the data over to third parties for a fee - that was what the CEO told the investment community until it got some blowback and he seemingly backtracked, claiming that he has been misunderstood, and Roomba merely "share the maps for free with customer consent".

The CEO said the company will never sell the data but its privacy policy allows it to do so. Consumers cannot verify his statement because business contracts are not shared publicly. Some models of Roomba have a camera and the company states that "[the camera] is separated from any wireless or wired transmission." It did not say that video images or clips of people's homes are not transmitted beyond the device. Video files can still be transmitted even if the camera itself is not connected to a transmitter directly. The concept of "customer consent" has been irreparably compromised by the prevalence of forced consent ("if you don't want to consent, we won't serve you.")

The article confirms a few things about Roomba:

The company has collections of maps about people's homes through cameras and other means. Such maps are beamed to the "cloud."

The maps that Roomba possesses are more detailed than those that are shown to users via the Roomba app

Roomba is talking to multiple third parties about selling or sharing data with them

The data will survive the dissolution of Roomba - as it is free to sell the data to an acquiring firm

Another good article about this is in the San Francisco Chronicle (link).

One would think that this is a fact-based statement that can be verified by all parties but in reality, despite the many big words, the statement is imprecise. To truly understand it, we have to ask the writer to define many terms.

First, "6 times as many" is a number without a unit. Are they talking about 6 times as many students? Or is it that the rate of propulsion is 6 times higher? If the unit is student, then we need to know the relative student body sizes of CUNY campuses versus the eight Ivies combined.

It's clear that CUNY takes many more students from lower-than-middle class than the Ivies - there may not be that many students in the Ivy League schools whose families are lower-than-middle class. If they are referring to absolute number of students, then the Ivies will be no match just because they have a lower number of students from lower-income families.

Are they counting all students who start first year, or just students who graduate? Do they include graduate and/or post-graduate students?

Second, "middle class and beyond" also needs a definition. There is no standard for the middle class. Are they looking at individual or household income?

Third, timing matters a lot. What is the observation window? Presumably they are not measuring the graduates one month or one year after they graduate - but we don't know for sure.

Fourth, how accurate are the income estimates? Are the inaccuracies uniform across all these schools? Is the amount of missing data comparable across all these schools?

Fifth, one must ask if there are any shameless shenanigans such as data being removed because of "outliers" and so on. An example might be: we don't want to include a particular campus because it is too new, and would not be representative of future results.

***

What are some practical lessons from this?

If a data analyst is given this statement and free access to the database, he or she will have a hard time replicating the analysis that leads to the conclusion. There are simply too many unspoken definitions.

Bringing such a statement to a meeting does not bring certainty. It may provoke many questions. People in the meeting may not agree on these definitions. Changing definitions may change the conclusion.

Not reported is how Harvard admission officers got a hold of such information. Is it possible there are jealous classmates?

I am not condoning the bad behavior - I put this link up to remind people: (a) no data are private, not even "deleted" data (b) data collected for one purpose can be used for another (c) there is a difference between what one does with the data, and what one can do if one chooses to (d) everyone has enemies under the appropriate context.

P.S. Additional reporting indicated that the private Facebook group was splintered off of an official Facebook group that Harvard admissions set up for admitted students to make friends with each other. We still don't know how they knew what happened inside a private group.