Posts categorized "Business tip"

In the last week, I kept being prompted by the recommendation engines to read the article in Slate titled "Don’t Just Look at the Odds: The Math Says You Should Buy a Mega Millions Ticket." It's obviously false but eventually I succumbed and clicked on the link.

I wanted to stop reading after I encountered this sentence, about half way through the article: "One assumption in this model is that you’ll take the annual payment option and receive the full $1.6 billion."

The pertinent assumption that they made isn't whether the winner should take the lump sum or annual payment options. The blockbuster is to assume that there would only be a single winner.

Earlier in the article, they stated a true but irrelevant point: that the odds of a winning ticket and the price of each ticket are still th same no matter how many tickets are sold.

I persisted in reading the entire article for the sake of writing this post. I wanted to know whether they will acknowledge the biggest factor here, which is that the more tickets are purchased, the more likely there will be more than one winners. Even having two winners can be devastating. The $1.6 billion suddenly turned into $0.8 billion. With three winners, it's $0.53 billion. If the ticket is purchased as part of an office pool with, say 10, participants, then $1.6 billion becomes $160 million. Not a bad payday but much, much worse than projected.

PS. They updated the post because too many people are pointing out the obvious about multiple winners. They continue to deny its importance.

Their math is wrong. Here is a good webpage that explains the math from a previous situation 10 years ago. The number of winners follows what is called a Poisson distribution. In the example given, there is a 60% chance of a single winner, but 30% chance of double winners and 10% chance of more than two winners. How can this be called "incredibly low odds"?

Another issue is that there are certain sets of numbers that attract much more tickets than usual. If one of those sets is the winner, you're going to have lots of winners.

It is conventional wisdom that A/B testing (or in proper terms, randomized controlled experiments) is the gold standard for causal analysis, meaning if you run an A/B test, you know what caused an effect. In practice, this is not always true. Sometimes, the A/B test only provides a statistical understanding of causes but not an average Joe's understanding.

Let's start with a hypothetical example in which both definitions are aligned. The only difference between the test and control groups was (junk) mail sent to the test group that advertised Amazon's Prime Day. If a higher proportion of the customers who got the junk mail shopped at Amazon on Prime Day than the control group, then everyone agrees that the mailing caused an increase in shopping activities (assuming the difference was statistically significant). Of course, not everyone was influenced by the mail but on average, more people in the test group shopped.

This consensus falls apart in the canonical A/B testing example. Imagine that the only difference between the test and control groups is the background color of the home page. The test group saw a darker background than the control group. The test group had a lower bounce rate than the control group. Statistically, the background color is regarded as the cause of the drop in bounce rate.

But what does that last sentence mean to the average Joe? There seems to be a missing link. It really isn't the color that is influencing someone's browsing behavior. It may be that the darker color causes certain text or images on the home page to be more visible, which causes more visitors on average to browse more pages, sticking around for a longer period of time. It could be the opposite. The darker color may make certain text less visible, which causes visitors to stick around for a longer period of time because they were struggling to find what they were looking for!

In other words, the background color may only be an indicator of the cause, not the true cause itself.

Some people have argued that it doesn't matter. You don't need to know what the true cause is. All you care about is that this result can be replicated.

I disagree with that position because taking that stance frequently leads to misguided future actions. The belief that background color somehow caused lower bounce rates inevitably leads you to test a variety of other colors (red, green, yellow, orange, blue, purple, ...) Almost all of these tests would be a waste of time. If, on the other hand, you learn that making your text clearer to the visitors is the true cause, then you may instead test larger font, different font types, placement of the text, etc.

In short, you should have some mental model behind the cause and effect mechanism.

In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.

In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.

For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.

***

Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.

***

As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.

A testing program using fixed samples may face one of several problems:

a) Too few tests are called significant.

b) Too many tests are called significant.

c) It takes too long to call a test.

You need to figure out what is your biggest problem.

Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)

Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.

The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.

On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.

The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.

[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]

***

How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.

Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.

Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.

I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.

In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:

When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.

Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)

Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:

a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.

b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.

***

Let me first explain why those are bad practices.

The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.

In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).

Now, if the multivariate test is being run to significance, you have a hydra of a head.

***

The Optimizely solution uses two key results from statistics:

a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.

b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.

***

What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.

If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.

In the next post, I have further thoughts for those customers who have more advanced protocols in place.

PS. This is Optimizely's official explanation of their changes on YouTube.

There have been few updates as I was working on things for other people. One of these things showed up today. Here is an excerpt from the beginning of my new article on HBR:

For over 10 years and at three companies, I set up and ran A/B testing programs, in which we test a new offer with half a sample against a control group which doesn’t get a new offer. Executives quickly pick up on the headline benefit of testing: that A/B tests provide reliable answers to “why” questions. This comes as no surprise, as such testing has long been held up as the “gold standard” for learning cause-and-effect in scientific research, clinical studies and direct marketing. However, many executives eventually reach a mid-life crisis, developing doubts about the direction of the A/B testing program.

From my experience, here are three of the most common questions that arise from those doubts, and how managers should think about them.

Here are five amazing recommendations by Avinash Kaushik from a post about how to make Web analytics dashboards better by simplifying.

Dashboards are not reports. Don't data puke. Include insights. Include recommendations for actions. Include business impact.NEVER leave data interpretation to the executives (let them opine on your recommendations for actions with benefit of their wisdom and awareness of business strategy).When it comes to key performance indicators, segments and your recommendations make sure you cover the end-to-end acquisition, behavior and outcomes.Context is everything. Great dashboards leverage targets, benchmarks and competitive intelligence to deliver context. (You'll see that in above examples.)This will be controversial but let me say it anyway. The primary purpose of a dashboard is not to inform, and it is not to educate. The primary purpose is to drive action!

Somewhere along the way we've lost our way. Dashboards are no longer thoughtfully processed analysis of data relevant to business goals with an included summary of recommended actions. They are data pukes. And data pukes are not dashboards. They are data pukes.

For those in Boston/Cambridge, I will be speaking at the Chief Data Scientist meetup on Wednesday night. See you there.

***

Warning: this post may be hard to understand if you don't know SQL.

SQL is one of the most fundamental tools in data science. It is used to manipulate data. Its simplicity is a big reason for its popularity. There are lots of things it can’t do but the few tasks it supports cover the majority of required tasks.

Over the years, I have noticed some bad habits of SQL coders. These habits tend to prevent the coders from “seeing” the imperfections in their data. Here are a few:

“Select top N” to “spot check” the data

Most analysts realize that they need to check the integrity of a data set. The easiest “check” is to eyeball the top N rows of data. In most cases, the data set is ordered in some way, not necessarily known to the analyst, so the top N rows do not form a representative sample of all rows.

Even if those top rows were as if random, it’s not clear what checks the analyst is performing mentally as he or she scrolls up and down a printed list of say 100 rows and 20 columns of data. Is the analyst looking for missing data? For extreme values? For discontinuity in the distribution? For out-of-range values? None of these tasks are simple enough to do in one's head.

Further, if there is a problem with the data, it usually comes from extreme or missing values, which are rare. Or if the data contain text, it may be that a few rows contain bad characters that will trip up SQL during a routine task.

Here’s the bottom line: if the data problem affects a huge chunk of the data, you will find it using a spot check, or any kind of checking. But most data problems affect a small corner of the data. A spot check will almost always miss these, leading to a false negative problem. The real trouble is when the analyst issues a bill of health after a spot check.

Assume that a data table has no duplicate rows

When merging data sets, it’s very easy to generate duplicate rows, if one or both of the data sets contain duplicate rows of the same match key. For instance, the analyst is merging the customer sales history data with the customer contact information data. The match key is the customer id number.

It is normal to assume that the contact database has only one row for each customer (who would design this table in any other way?), and nine times out of ten, this assumption will be correct.

The one time is when your CEO needs an updated sales number right now for a board meeting. Oops, the sales number is double the expected value. The culprit is that duplicate customer ids made their way into the contact history table so that when it is merged with the transactions history, each sales record is replicated one or more times.

It may sound like a waste of time but before merging any data, check each table for duplicates.

Use open-ended time windows

In business analytics, we are always counting events over time, be they sales transactions, clickthroughs, responses to offers, etc. I have a pet peeve: code that does not have explicit accounting windows, meaning a starting time and an ending time.

Such code is not auditable. Every time you run the code, it will generate a different count (unless your business has infrequent events). If you wrote the code yesterday, and I ran it today, the counts would be different. How will I know if the difference is entirely due to the longer accounting window or if there are problems with the underlying data?

The usual excuse for this coding practice is that the business wants the “most updated” number, up to the very last microsecond. Let me assure you: a day-old number that has been verified is preferred to a second-old number that cannot be audited.

The other excuse is the code is hard to maintain since you have to hard-code the ending time. But that is letting the tool limit your analytical ability. There are plenty of tools for which this is not a limitation.

***

When I talk about numbersense, I am also talking about the habits of the analysts. Bad habits doom many analyses before takeoff.

Have you encountered these issues? Do you have your own list of bad habits? Let me know!

It seems like Seth Kugel's article in the New York Times about "Crunching the Numbers to find the Best Airfare" is quite popular. In this article, he said things like this:

The overall take on the best day to book tickets turns out to be somewhat underwhelming, if you look at the country as a whole. Hopper’s data shows it’s actually Thursday, but don’t expect that fact to save you much money. Reserve a domestic flight on Thursday and you’ll spend, on average, $10 less than if you reserve on Saturday, the worst day to book domestic flights. With international flights, you’ll save, on average, $25 over Sunday, the worst day to book flights abroad. (Those are “maximum averages” that assume you would have booked on the worst day and are now booking on the best.)

This is meaningless navel-gazing.

As I explained in my notes to my Kayak article on 538, talking about best or worst fares is meaningless unless one can describe a strategy with which the traveler could attain those fares. This strategy must work in real time, before it is known that a particular fare would be the best or the worst.

Without such a strategy, we are talking about paper gains and losses.

Analysts who follow Kugel's logic though rarely realize that they are talking about paper money. So, later in the article, Kugel said this:

For the vast majority of routes,... avoid booking on weekends and try midweek; for the average American flier, those savings will add up in the long run.

What savings? Those would appear to be the "maximum averages" defined above, the difference between the best and the worst days for given routes based on a lot of historical data. But there is no strategy to reliably attain the best fare; in fact, there is no strategy to reliably buy the worst-priced tickets either.

As I said before, if the goal is to gain provable savings, you need to write down how you are making the purchase decision today, then you need to define what your new strategy is--whether it is using Kayak, or using Google (which doesn't do predictions)--and then you should compare the two methods.

I saw Joe N.'s tweet asking me about a study of how professors spend their time, reported by Lisa Wade at Sociological Images. This is an anthropological study, something that I am not at all familiar with although the people in the field seem to believe that they can make statistically valid observations.

I'm glad the author of the study, John Ziker, wrote a (really) long article describing what he was trying to accomplish. The key point is that the study is a preliminary exploration, with important limitations; a follow-up study is planned which may give generalizable conclusions.

Here are some issues with the first study that makes a statistician nervous:

- the sample was between 14 and 30 professors (tiny): Wade reported it to be 16. Ziker definitely started with 30.

- the selection was non-random, based on the first 30 people who responded to a school-wide announcement

- about half the initial respondents did not complete the study, and provided only partial data (one to six days)

- despite the tiny sample, some analysis required slicing the data further into four segments by grade level! I wonder how many department chairs were in that sample. (See chart on right)

- each professor is followed for a two-week period but only every other day, thus each professor at most contributed one observation per day of week

- the interviews were every other day "so the time taken for the interview did not appear on the previous day’s report." This is a horrible problem to deal with! Because time allocation is the subject of the study, the measurement method (in-depth interviewing) interferes with the measured outcome. It seems to me impossible to believe that the time spent answering questions every other day did not affect time allocation on the non-interview days.

- Ziker reasoned: "While we cannot make a claim that all faculty have the same work patterns as our initial subject pool — they do not comprise a random sample — the results are highly suggestive because of the consistency across our subjects who did represent.". In order not to fall prey to the law of small numbers, a better way to say this is: we make the assumption that the small sample is representative on both mean value and dispersion, which then leads to the assumption that all faculty have consistent work patterns similar to the observed.

- "With our initial 30 Homo academicus subjects, we ended up with a 166-day sample with each day of the week well represented." I am assuming that Ziker did not drop the 16 professors with partial data and made charts like the one on the right by ignoring the identity of the professor and aggregating over days of the week. Let's review what lies behind this chart. Each respondent contributed at most one observation per day of week; about half of the respondents did not even contribute data for all seven days. So the time allocation on any particular day is averaged over anywhere from 14 to 30 professors. These professors span a variety of ranks, departments, tenure, backgrounds, etc. and were not randomly selected. It's hard for me to trust this chart at all.

***

In general, I am a big fan of shoe leather research in which the researcher goes out there and gather the relevant data they need to address their specific research question, rather than picking up what data they could find, and then tailoring the research question to avoid the imperfection in the data. So I don't want to sound too negative. It's a difficult research problem they are dealing with. What they learned from this first study is useful to inform future explorations but drawing conclusions at this stage is premature.

At the end of his article, Ziker described the "experience sampling" method that will form the next phase of this study. I am very excited about this methodology.

Roughly speaking, they will ask participants to install a mobile app, which pops questions from time to time asking them what they are doing at that moment. Instead of exhaustively tracking a small number of participants over the course of time, they will get little bits of data, incomplete schedules, for a large number of professors. If the sample is big enough and randomized appropriately, they can analyze the data ignoring the professor identity, and report results for the "average professor". This method also retains the other benefit of the original design, which is that the respondents report their activities close to the time in which they occurred.

Data scientists pay attention! You don't have to collect complete data at the user level to do proper research. Designs like this "experience sampling" approach produce statistically valid findings without the need for complete data. In fact, trying to collect complete data is counterproductive, leading to shaky conclusions as shown above.