The debate over how and when to test for statistical significance comes up nearly every engagement. Why wouldn’t we just test everything?

-M.O. in Chicago

Hi M.O.-

You’re not alone. Many clients want all sorts of things stat tested. Some things can be tested while others can’t. But for what can be tested, as market researchers we need to be mindful of two potential errors in hypothesis testing. Type I errors are when we reject a true null hypothesis. For example, if we accept the claim that Coke tastes better than Pepsi, it’s erroneous because in fact, it’s not true.

A type II error occurs when we accept the null hypothesis when in fact it is false. This part is safe to install and then the plane crashes. We choose the probability of committing a type I error when we choose alpha (say .05). The probability of a type II error is a function of power. We seldom take this side of the equation into account for good reason. Most decisions we make in market research don’t come with a huge price tag if we’re wrong. Hardly anyone ever dies if the results of the study are wrong. The goal in any research is to minimize both types of errors. The best way to do that is to use a larger sample.

This conundrum perfectly illustrates my “Life is a conjoint” mantra. While testing we’re always trading off between the accuracy of the results with the cost of executing a study with a larger sample. Further, we also tend to violate the true nature of hypothesis testing. More often than not, we don’t formally state a hypothesis. Rather, we statistically test everything and then report the statistical differences.

Consider this: when we compare two scores, we accept that we might get a statistical difference of 5% of the time simply by chance (a=.05). This could be the difference in concept acceptance between males and females.

In fact, that’s not really what we do, we perform hundreds of tests in most every study. Let’s say we have five segments and we want to test them for differences in concept acceptance. That’s 10 t-tests. Now we have a 29% chance of flagging a difference simply due to chance. That’s in every row of our tables. The better test would be to run an analysis of variance on the table to determine if any cell might be different. Then build a hypothesis and test them one at a time. But we don’t do this because it takes too much time. I realize I’m not going to change the way our industry does things (I’ve been trying for years), but maybe, just maybe you’ll pause for a moment when looking at your tables to decide if this “statistical” significance is really worth reporting—are the results valid and are they useful?.

We want to assess the importance of fixing some of our customer touchpoints, what would you recommend as a modeling tool?

-Alicia

Hi Alicia,

There are a variety of tools we use to determine the relative importance of key variables on an outcome (dependent variable). Here’s the first question we need to address: are we trying to predict the actual value of the dependent variable or just assess the importance of any given independent variable in the equation? Most of the time, the goal is the latter.

Once we know the primary objective, there are three key criteria we need to address. The first is the amount of multicollinearity in our data. The more independent variables we have, the bigger problem this presents. The second is the stability in the model over time. In tracking studies, we want to believe that the differences between waves are due to actual differences in the market and not artifacts of the algorithm used to compute the importance scores. Finally, we need to understand the impact of sample size on the models.

How big a sample do you need? Typically, in consumer research, we see results stabilize with n=200. Some tools will do a better job with smaller samples than others. You should also consider the number of parameters you are trying to model. A grad school rule of thumb is that you need 4 observations for each parameter in the model, so if you have 25 independent variables, you’d need at least 100 respondents in your sample.

There are several tools to consider using to estimate relative importance: Bivariate Correlations, OLS, Shapley Value Regression (or Kruskal’s Relative Importance), TreeNet, and Bayesian Networks are all options. All of these tools will let you understand the relative importance of the independent variables in predicting your key measure. One think to note is that none of the tools specifically model causation. You would need some sort of experimental design to address that issue. Let’s break down the advantages and disadvantages of each.

Bivariate Correlations (measures the strength of the relationship between two variables)

Advantages: Works with small samples. Relatively stable wave to wave. Easy to execute. Ignores multicollinearity.

Disadvantages: Only estimates the impact of one attribute at a time. Ignores any possible interactions. Doesn’t provide an “importance” score, but a “strength of relationship” value. Assumes a linear relationship among the attributes.

Advantages: Easy to execute. Provides an equation to predict the change in the dependent variable based on changes in the independent variable (predictive analytics).

Disadvantages: Highly susceptible to multicollinearity, causing changes in key drivers in tracking studies. If the goal is a predictive model, this isn’t a serious problem. If your goal is to prioritize areas of improvement, this is a challenge. Assumes a linear relationship among the attributes.

Shapley Value Regression or Kruskal’s Relative Importance

These are a couple of approaches that consider all possible combinations of explanatory variables. Unlike traditional regression tools, these techniques are not used for forecasting. In OLS, we predict the change in overall satisfaction for any given change in the independent variables. These tools are used to determine how much better the model is if we include any specific independent variable versus models that do not include that measure. The conclusions we draw from these models refer to the usefulness of including any measure in the model and not its specific impact on improving measures like overall satisfaction.

Advantages: Works with smaller samples. Does a better job of dealing with multicollinearity. Very stable in predicting the impact of attributes between waves.

Disadvantages: Ignores interactions. Assumes a linear relationship among the attributes.

TreeNet (a tree-based data mining tool)

Advantages: Does a better job of dealing with multicollinearity than most linear models. Very stable in predicting the impact of attributes between waves. Can identify interactions. Does not assume a linear relationship among the attributes.

Advantages: Does a better job of dealing with multicollinearity than most linear models. Very stable in predicting the impact of attributes between waves. Can identify interactions. Does not assume a linear relationship among the attributes. Works with smaller samples. While a typical Bayes Net does not provide a system of equations, it is possible to simulate changes in the dependent variable based on changes to the independent variables.

Disadvantages: Can be more time-consuming and difficult to execute than the others listed here.

Dr. Jay Weiner is CMB’s senior methodologist and VP of Advanced Analytics. Jay earned his Ph.D. in Marketing/Research from the University of Texas at Arlington and regularly publishes and presents on topics, including conjoint, choice, and pricing.

I’m interested in testing a large number of features for inclusion in the next version of my product. My team is suggesting that we need to cull the list down to a smaller set of items to run a choice model. Are there ways to test a large set of attributes in a choice model?

-Nick

Hi Nick –

There are a number of ways to test a large set of attributes in choice modeling. Most of the time, when we test a large number of features, many are simply binary attributes (included/not included). While this makes the experimental design larger, it’s not quite as bad as having ten six-level attributes. If the description is short enough, you might go ahead and just include all of them. If you’re concerned about how much reading a respondent will need to do—or you really wouldn’t offer a respondent 12 additional perks for choosing your credit card—you could put a cap on the number of additional features any specific offer includes. For example, you could test 15 new features in a single model, but respondents would only get up to 5 at any single time. This is actually better than using a partial profile design as all respondents would see all offers.

Another option is to do some sort of bridging study where you test all of the features using a max diff task. You can include a subset of the factors in a DCM and then use the max diff utilities to compute the utility for the full list of features in the DCM. This allows you to include the full set of features in your simulation tool.

The path to brand loyalty is often like the path to wedded bliss. You begin by evaluating tangible attributes to determine if the brand is the best fit for you. After repeated purchase occasions, you form an emotional bond to the brand that goes beyond those tangible attributes. As researchers, when we ask folks why they purchase a brand, they often reflect on performance attributes and mention those as drivers of purchase. But, to really understand the emotional bond, we need to ask how you feel when you interact with the brand.

We recently developed a way to measure this emotional bond (Net Positive Emotion Score - NPES). By asking folks how they felt on their most recent interaction, we’re able to determine respondents’ emotional bond with products. Typical regression tools indicate that the emotional attributes are about as predictive of future behavior as the functional benefits of the product. This leads us to believe that at some point in your pattern of consumption, you become bonded to the product and begin to act on emotion—rather than rational thoughts. Of course, that doesn’t mean you can’t rate the performance dimensions of the products you buy.

Loyalty is a behavior, and behaviors are often driven by underlying attitudinal measures. You might continue to purchase the same product over and over for a variety of reasons. In a perfect world, you not only create a behavioral commitment, but also an emotional bond with the brand and, ultimately, the company. Typically, we measure this path by looking at the various stages you go through when purchasing products. This path begins with awareness, evolves through familiarity and consideration, and ultimately ends with purchase. Once you’ve purchased a product, you begin to evaluate how well it delivers on the brand promise. At some point, the hope is that you become an advocate for the brand since advocacy is the pinnacle of the brand purchase hierarchy.

As part of our Consumer Pulse program, we used our EMPACT℠: Emotional Impact Analysis tool to measure consumers’ emotional bond (NPES) with 30 brands across 6 categories. How well does this measure impact other key metrics? On average, Net Promoters score almost 70 points higher on the NPES scale versus Net Detractors. We see similar increases in likelihood to continue (or try), proud to use, willingness to pay more, and “I love this brand.”

What does this mean? It means that measuring the emotional bond your customers have with your brand can provide key insights into the strength of that brand. Not only do you need to win on the performance attributes, but you also need to forge a deep bond with your buyers. That is a better way to brand loyalty, and it should positively influence your bottom line. You have to win their hearts—not just their minds.

Dr. Jay Weiner is CMB’s senior methodologist and VP of Advanced Analytics. He has a strong emotional bond with his wife of 25 years and several furry critters who let him sleep in their bed.

The city of Boston is trying develop one key measure to help officials track and report how well the city is doing. We’d like to do that in house. How would we go about it?

-Olivia

Hi Olivia,

This is the perfect tie in for big data and the key performance index (KPI). Senior management doesn’t really have time to pour through tables of numbers to see how things are going. What they want is a nice barometer that can be used to summarize overall performance. So, how might one take data from each business unit and aggregate them into a composite score?

We begin the process by understanding all the measures we have. Once we have assembled all of the potential inputs to our key measure, we need to develop a weighting system to aggregate them into one measure. This is often the challenge when working with internal data. We need some key business metric to use as the dependent variable, and these data are often missing in the database.

For example, I might have sales by product by customer and maybe even total revenue. Companies often assume that the top revenue clients are the bread and butter for the company. But what if your number one account uses way more corporate resources than any other account? If you’re one of the lucky service companies, you probably charge hours to specific accounts and can easily determine the total cost of servicing each client. If you sell a tangible product, that may be more challenging. Instead of sales by product or total revenue, your business decision metric should be the total cost of doing business with the client or the net profit for each client. It’s unlikely that you capture this data, so let’s figure out how to compute it. Gross profit is easy (net sales – cost of goods sold), but what about other costs like sales calls, customer service calls, and product returns? Look at other internal databases and pull information on how many times your sales reps visited in person or called over the phone, and get an average cost for each of these activities. Then, you can subtract those costs from the gross profit number. Okay, that was an easy one.

Let’s look at the city of Boston case for a little more challenging exercise. What types of information is the city using? According to the article you referenced, the city hopes to “corral their data on issues like crime, housing for veterans and Wi-Fi availability and turn them into a single numerical score intended to reflect the city’s overall performance.” So, how do you do that? Let’s consider that some of these things have both income and expense implications. For example, as crime rates go up, the attractiveness of the city drops and it loses residents (income and property tax revenues drop). Adding to the lost revenue, the city has the added cost of providing public safety services. If you add up the net gains/losses from each measure, you would have a possible weighting matrix to aggregate all of the measures into a single score. This allows the mayor to quickly assess changes in how well the city is doing on an ongoing basis. The weights can be used by the resource planners to assess where future investments will offer the greatest pay back.

Dr. Jay is fascinated by all things data. Your data, our data, he doesn’t care what the source. The more data, the happier he is.