Sampling: A Primer

Though it doesn’t get a lot of buzz, sampling is fundamental to any field of science. Marketing scientist Kevin Gray asks Dr. Stas Kolenikov, Senior Scientist at Abt Associates, what marketing researchers and data scientists most need to know about it.

Kevin Gray: Sampling theory and methods are part of any introductory statistics or marketing research course. However, few study it in depth except those majoring in statistics and a handful of other fields. Just as a review, can you give us a layperson’s definition of sampling and tell us what it’s used for?

Stas Kolenikov: Sampling is used when you cannot reach every member of your target population. It’s used in virtually all marketing research, as well as most social, behavioral and biomedical research. Research projects have limited budgets but, by sampling, you can obtain the information you need with maybe 200 or 1,000 or 20,000 people – just a fraction of the target population.

So, sampling is about producing affordable research, and good sampling is about being smart in selecting the group of people to interview or observe. It turns out that the best methods are random sampling, in which the survey participants are selected by a random procedure rather than chosen by the researcher.

The exceptions – where sampling is avoided – are censuses, where each person in a country needs to be counted, and Big Data problems that use the entire population in a data base (though we must keep in mind that the behavior and composition of a population is rarely static).

Sampling didn’t just appear out of thin air. Can you give us a very brief history of sampling theory and methods?

Sampling, as many other statistical methods, originated out of necessity. The Indian Agricultural Service in the 1930­–1940s worked on improving methods to assess the acreage and total agricultural output for the country, and statisticians such as Prasanna Chandra Mahalanobis invented what is now known as random sampling. The Indian Agricultural Service switched from complete enumeration to sampling, which was 500 times cheaper while producing a more accurate figure. Random sampling came to the United States in 1940s and is associated with names such as Morris Hansen, Harold Hotelling and W. Edwards Deming.

At about the same time, the 1936 U.S. Presidential election marked the infamous failure of the very skewed sample used in the Literary Digest magazine poll, and the rise to fame of George Gallup, a pollster who first attempted to make his sample closer to the population by employing quota sampling. Quota sampling today is considered inferior to random sampling, and Gallup later failed in the 1948 election (“Dewey Defeats Truman”).

What are the sampling methods marketing researchers and data scientists most need to understand?

There are four major components in a sample survey. These “Big Four” are: 1) potentially unequal probabilities of selection; 2) stratification; 3) clustering; and 4) post-survey weight adjustments.

Unequal probabilities of selection may arise by design. For example, you may want to oversample certain ethnic, religious or linguistic minorities. In some surveys, unequal probabilities of selection are unavoidable – for instance, in a phone survey, people who have both landlines and cell phones have higher probabilities of selection than those who use landline only or cell phone only. Unequal probabilities of selection are typically associated with reduction in precision compared to surveys that uses an equal probability of selection method (EPSEM). At the analysis stage, ignoring unequal probabilities of selection results in biased summaries of the data.

Cluster, or multi-stage sampling, involves taking samples of units that are larger than the ultimate observation units – e.g., hospitals and then patients within hospitals, or geographic areas and then housing units and individuals within them. Clustering increases standard errors, but is often unavoidable or is more economical when a full list of units is unavailable or expensive to assemble, while the data for some hierarchy of units is relatively easy to come by.

Stratification involves breaking down your target population into segments, or strata, before sampling, and then taking independent samples within strata. It is typically used to provide some degree of balance and to avoid outlier samples. For instance, in a simple random sample of U.S. residents, by chance, you might wind up with only Vermont residents in your sample. This is very unlikely, but it could happen. By stratifying the sample by geography, and allocating the sample proportionally to state populations, the sample designer rules out these odd samples.

Stratification is also used when information in the sampling frame allows you to identify target subpopulations. Many U.S. surveys oversample minorities, such as African Americans or Hispanics, to obtain higher precision for these subgroups than would be achieved under an EPSEM design. While there is no list with race/ethnicity to sample from, these samples utilize the geographic concentration of these minorities, with higher sampling rates used in the areas with higher population densities of these minority populations. Stratification typically decreases standard errors, and the effect depends on how strong the correlation is between the stratification variable(s) and the outcome (survey questions) of interest.

Post-survey weight adjustments (including adjustments for nonresponse and noncoverage) are aimed at making the actual sample represent the target population more closely. Say, if a survey ended up with 60% females and 40% males, while the population is split 50-50, the sample would be adjusted so that the weighted summaries of their attitudes reflect the true population figures more closely.

What are some common sampling mistakes you see made in marketing research and data science?

The most common mistake, I think, is to ignore the source of your data entirely. It would be unrealistic to use a sample of undergrads in your psychology class to represent all of humankind!

One other common mistake that I often see is that researchers ignore the complex survey data features and analyze the data as if they were simple random samples. Earlier, I outlined the impact the Big Four components have on point estimates and standard errors, and in most re-analyses I have done or seen, the conclusions drawn, and actions taken by the survey stakeholders based on these conclusions, are drastically different if we mistakenly assume random sampling.

In the past 10 years or so, survey methodologists have solidified their thinking about sources of errors and developed the total survey error (TSE) framework. I would encourage marketing researchers to familiarize themselves with the main concepts of TSE and start applying them in their work as well.

Has big data had an impact on sampling in any way?

Survey projects can often employ a much greater variety of data sources to draw the samples from. Some projects utilize satellite imagery, or even drone overflights, to create maps of the places from which the samples will be drawn and field interviewers deployed, in order to optimize work load.

On the other hand, whether or not a particular big data project would benefit from sampling often depends on the type of analytical questions asked. Deep Learning typically requires as large a sample as you can get. Critical threat detection must process every incoming record. However, many other big data projects that are primarily descriptive in nature may benefit from sampling. I have seen a number of projects where big data were used only to determine the direction of change, though a small sample may have sufficed.

What about social media research – are there particular sampling issues researchers need to think about?

Social media is a strange world, at least as far as trying to do research goes. Twitter is full of bots and business accounts. Some people have multiple accounts, and may behave differently on them, while other people may only post sporadically. One needs to distinguish the population of tweets, the population of accounts, its subpopulation of accounts that are active, and the population of humans behind these accounts. It is the last of these we researchers are usually after when we talk about sentiment or favorability. Still, this last population is not the population of consumers since many people don’t have social media accounts, and their attitudes and sentiments cannot be analyzed via social media.

Are there recent developments in sampling theory and practice that marketing researchers and data scientists should know about?

The mathematics of sampling are still the same as outlined by Mahalanobis, Hansen and others in the 1940s. While theoretical developments continue to sprout, most of new developments seem to be on the technological side and on the data base side. For instance, we now have effective ways to determine if a given cell number is an active one before you start dialing it. The newly developed commercial data bases allow us to get a list of valid mailing addresses in a given area, and then find out more about people who live at these addresses based on their commercial or social network activity. Sampling statisticians need to know way more than just mathematical aspects of sampling these days, and need to understand how to interact with the data sources that they will draw their samples from.

Stas Kolenikov, Senior Scientist at Abt Associates, received his PhD in Statistics from the University of North Carolina at Chapel Hill. His primary responsibilities in the company involve sampling designs, weighting of the data, and other statistical tasks related to surveys. He has taught statistics at the University of Missouri and conferences sponsored by the American Association of Public Opinion Research (AAPOR), the Statistical Society of Canada, the American Statistical Association, and Stata Corp.

This article was first published in Greenbook in July, 2017. The background photo is of Prasanta Chandra Mahalanobis, a pioneer in sampling and one of the most prominent statisticians of the 20th century.