Introduction: The purpose of this article is to provide a general understanding of the concepts of sampling as applied to health-related research. Sample Size Estimation: It is important to select a representative sample in quantitative research in order to be able to generalize the results to the target population. The sample should be of the required sample size and must be selected using an appropriate probability sampling technique. There are many hidden biases which can adversely affect the outcome of the study. Important factors to consider for estimating the sample size include the size of the study population, confidence level, expected proportion of the outcome variable (for categorical variables)/standard deviation of the outcome variable (for numerical variables), and the required precision (margin of accuracy) from the study. The more the precision required, the greater is the required sample size. Sampling Techniques: The probability sampling techniques applied for health related research include simple random sampling, systematic random sampling, stratified random sampling, cluster sampling, and multistage sampling. These are more recommended than the nonprobability sampling techniques, because the results of the study can be generalized to the target population.

Selecting a sample that is representative of the general population is an important part of quantitative research. One of the major reasons that articles are rejected by good quality peer-reviewed journals is due to a nonrepresentative sample or not having an adequate sample size. [1] The results of a poorly selected sample that is different from the target population cannot be applied to the general population. [2] A smaller than required sample size may not have the appropriate power to identify significant differences or associations that may be present in the target population. The purpose of this article is to provide an overview of the importance of sampling in health research and to provide the readers with useful tips and resources for selecting a representative sample.

The first step in understanding the sampling process is to be familiar with the terminology [Table 1]. A sample is a subset of the total population that is of interest for the study topic. This "total" population is called the target population, to which the results of the study can be generalized. [3] For example, the outcomes of a study based on patients admitted in a tertiary care hospital in a major city cannot be generalized to patients presenting with the same condition to other types of health care facilities in smaller towns. The sample itself is selected from a section of the target population that is accessible to the researcher, which is called the study population. [4] The study population may be as simple as a list of patients admitted with a certain disease, or it may be as obscure as all patients visiting any health care facility with different signs and symptoms.

Even if all the available study population is being selected, it is important to keep in mind that the study population may still be inherently different from the target population. [5] Patients who present with a specific disease to a tertiary health care facility are not representative of all the patients with that disease in that area. The difference may be in the severity of the disease or even with the demographics of the patients depending on the type of health care facility, that is, Ministry of Health, Military hospital, Private hospital, etc. Patients who present to a specific health care center may be different from those who go to another health center or another health care provider. Hence, it is not advisable to generalize the results from a single hospital-based study to the whole city let alone the entire country. [6],[7] Another important point to consider is that people who agree to participate in the study may be different from the nonresponders. In general responders tend to be more health conscious and more literate, or they may be more likely to have a chronic condition as compared to an acute exacerbation of the disease. Hence, the results of the study may be different from the outcome in the general population either in a positive or negative direction, depending on how the responders are different from the nonresponders. [8]

It is important to show that the selected sample is representative of the study and the target population with regards to the demographic and other relevant characteristics that may affect the outcome of the study. [9] For example, in a study to compare the outcomes of diabetic patients being managed by an endocrinologist as compared to those being managed by family physicians. It is important to consider that the two groups have the same socioeconomic characteristics with regards to age, gender, income, and education since all of these are related to the outcome. [10] It is also important to consider the severity and duration of disease since patients presenting to the endocrinologist may be more likely to be already having complications due to diabetes mellitus. It is recommended to obtain the relevant demographic and background information from the responders to demonstrate that they are representative of the target/study population. [10] Additional information that may be easily obtained should also be collected about the nonresponders/lost to follow-up cases, like area of residence, body mass index (BMI), smoking status, etc. This will be useful to demonstrate that the responders are similar to the nonresponders with regards to these background variables. [11]

Sample size estimation

A sample must be of the required size in order to have the required degree of accuracy in the results as well as to be able to identify any significant difference/association that may be present in the study population. [12] Determining the minimum required sample size for achieving the main objectives of the study is of prime importance for all studies but is generally neglected by most novice researchers. A common practice is to select all the cases that are available (consecutive sampling) in a given period of time or to select a sample size based on a previous study. [13] Another practice is to select a sample of 50 or 100 patients depending upon the time and resources available. [14] While the above assumptions may be adequate in some cases, they are generally not appropriate, especially for studies which require the comparison of two or more groups with respect to one or more outcomes of interest.

The factors that need to be considered when determining the required sample size include the size of the study population (from which the sample is to be selected), the confidence level (generally set at 95% confidence level), the expected prevalence or variance of the main outcome variable that is being studied, and the required margin of error/accuracy that is acceptable for the study. [12],[15] In studies comparing two or more groups, the power of the study is generally set at 80% and additional information regarding the expected difference between the two groups, will also be required. [16] Nowadays, it is not required to go about looking up difficult formulae and going through complicated calculations in order to determine the required sample size. There are a number of free online software and easily accessible websites like Open-Epi, [17] RaoSoft, [18] Pi-face, [19] etc., which can estimate a number of permutations for the required sample size based on the estimated parameters for the study population.

The researcher does need to do some preparation in advance before estimating the required sample size. The simplest scenario is a single sample study, where the prevalence of a specific variable is required in the study population, e.g., prevalence of diabetes mellitus or its complications. The additional information to determine the required sample size includes the estimated size of the study population (if very large then use 20,000), the expected prevalence of the main variable (if unknown then use 50%), and the required margin of accuracy (generally set at 10% or 5%). [20] The margin of accuracy is related to how accurate the required result is with regards to being close to the expected population value, the more precise the required results, the greater is the sample size required. Generally for an expected prevalence of around 50% for the outcome variable a margin of accuracy of ±10% requires a sample size of around 100, which increases to around 400 for an accuracy of ±5% and 10,000 for ±1% margin of accuracy.

In case of determining the sample size for determining the mean value for a numerical variable (e.g., BMI, cholesterol level, etc.,), the additional information required is for the expected variance of the required variable in the target population. [21] This information can be obtained from the literature review of similar studies in the form of the standard deviation (SD) for the required variable. The higher the SD, the greater will be the required sample size. In case the SD is not known for the target population, it can be estimated by taking the difference between the estimated "highest" and "lowest" values in the population and dividing it by four (±2 SD on either side of the mean for the "normal" distribution). [22] For example, the BMI for a group of diabetics is expected to have a high value of 48 and a low value of 16 kg/m 2 . Hence, the "normal" range is 48-16 = 32, which gives an estimated value for the SD of ±8 (32 divided by 4). The other information required for determining the sample size is the accuracy of the estimated mean, that is, how close it should be to the actual population mean. [23] In the above example, for BMI, the accuracy can be set as ±1, ±2 or ±4 kg/m 2 , the general rule is that the more precise the required accuracy, the greater is the required sample size. [24] A summary of the information required for estimating the sample size is given in [Table 2].

The requirements for determining the sample size for comparing between two (or more) groups becomes more complex with the requirements for estimation about the expected prevalence in both the groups (for categorical variables) and the expected difference of means (for numerical values). But the basic rule is the same - the greater the variability of the variable under study or the the more precise the required accuracy, the greater is the required sample size. [12][Table 3] shows the estimated sample sizes for a categorical variable (hypertension) and a numerical variable (systolic blood pressure) for comparing these variables between smokers and nonsmokers for different level of accuracies. It is up to the researcher to select the required criteria according to the study objectives and the available resources. It should be kept in mind that these are all based on estimates and if the sample results are found to have more variability than used in the estimation then the P values will not be statistically significant. If there is provision for doing a pilot study, then the estimated prevalence or SD can be more accurately determined based on a smaller sample from within the study population for determining the required sample size more accurately. [25]

The other important issue related to sampling is selecting the required sample size in a manner, so that the sample is representative of the study population. [7] It is a common pitfall to opt for the easier option of convenience sampling where "all" the available persons in the study population are selected for the study until the required sample size is reached. This is nonprobability sampling, where the sample is less likely to be representative of the study population due to inherent biases in the sampling process. [13] Other forms of nonprobability sampling include purposive sampling, quota sampling, and snowball sampling, where the sample is selected according to some predetermined criteria. These type of sampling techniques are more appropriate for small level studies which are not meant to be generalized to a larger population. [13]

The more relevant sampling technique is called "probability sampling" or "random sampling." [26] It is important to note here that the word "random" as used in this context is different from the normal usage in the everyday terms. It is misleading to state that the sample was chosen at random from all the patients coming to the outpatient clinic. In order to be classified as random or probability sampling, every person in the study population must have an equal or known probability of being included in the sample. [7] It is quite common to overlook some hidden biases in the sampling process which adversely affect the outcome of the study. For example, if a study was to be conducted to determine the satisfaction of patients coming to a health care center and the decision was to sample every third patient who was coming out of the center. Apparently, this seems to be "unbiased" if every third person was selected accordingly. But one hidden factor is related to the outcome of the study, that is, satisfaction with the care provided. A person who is not satisfied with the health care provided would be unlikely to return to the center or would come only once a month, while a person who is satisfied would be returning more frequently maybe 2-3 times a month. Hence, it is quite likely that the result of the satisfaction survey shows a more positive result than the actual perception. [27] One way to account for this hidden bias is to interview only "new" patients who are visiting the clinic for the 1 st time or it may be sufficient to just ask the respondent how many times s/he has visited the clinic in the last month or year. [28] The same bias may be associated with random digit dialing for a phone survey. Apparently, the computer dials a number randomly so there should be no bias in the sample selection? Actually, there is still a hidden bias that people who have two phones (or double SIM phones) are twice as more likely to be selected as compared to the majority of people who have only one number. [29] The people with >1 phone are more likely to have a higher income so this may bias any study which may be asking about their perceptions about health care insurance or even about choosing between prepaid/postpaid mobile phone services. This type of bias can be controlled for by simply recognizing this as a bias at the planning stage of the survey and including a question on "Howmanyphonenumbersdoyouhave?" in the survey. This can be used to appropriately weight the responses of such respondents in the final analysis stage. [29]

The types of probability sampling methods include simple random sampling, systematic random sampling, and stratified random sampling [Table 4] - these three methods are more relevant when a sample frame (list of the people in the study population) is available. [7] Simple random sampling is as simple as picking up chits (names or numbers written on pieces of paper) from the box for a small study population of up to 30-50 people. For larger study population, a computer-generated random number table can be used to select the respondents accordingly, e.g., every n th person coming out of a clinic or selecting the n th person from each household. [7] Systematic random sampling is applicable when the study population is relatively large (100 or more) and a list is available of all the members, e.g., employees in a hospital, medical students in a class, or even beds in a hospital. The total number of subjects in the list is divided by the required sample size to obtain the "skip number" e.g., to select 25 out of a list of 200 the skip number will be every 8 th person on the list (200/25 = 8). The next step is to choose a number randomly from between 1 and 8 which will be the first person selected and then systematically select every 8 th person from the list till the end of the list is reached, e.g., 3, 11, 19, 27, …, 195. It is important to remember that the first person should be chosen randomly - arbitrarily selecting the 1 st person or the 8 th person on the list will lead to zero probability of the other persons in the list being selected. [30] Stratified random sampling is a form of systematic random sampling with the addition that the list is stratified (arranged by categories) according to a predetermined characteristic, e.g., gender, level of employees, class in medical college. After arranging the list according to the specified criterion, the same process of selecting every n th person is followed as in systematic random sampling. [30] The stratified random sampling technique ensures that the sample contains approximately the same proportion of the specified criterion as in the study population. This is important when the outcome variable that is being studied is directly related to that particular characteristic, e.g., gender and smoking, employee satisfaction, and level of employees. The other two probably sampling methods of cluster sampling and multistage sampling are more appropriate for a community based or large scale surveys and will not be described in detail in this article. More information on these two methods can be obtained from other detailed text on sampling. [7],[9],[10],[15],[30] The issue of avoiding bias due to nonresponse in sampling will be discussed in detail in the next article on data collection methods.

Table 4: List of different probability and nonprobability sampling methods

The issue of sampling is of an important consideration in all quantitative research which aims to generalize the finding of the study to a larger population. It is essential to have the required sample size as well as to select a representative sample using the appropriate sampling technique.

National Research Council (US) Committee on Guidelines for the Use of Animals in Neuroscience and Behavioral Research. Sample Size Determination. In: Guidelines for the Use of Animals in Neuroscience and Behavioral Research. Washington DC: National Academies Press; 2003. Appendix A. Available from: http://www.ncbi.nlm.nih.gov/books/NBK43321/#a20007f55ddd00182. [Last accessed on 2014 Sep 25].

Un Procedimiento De Selecciin De Sub-Muestras De Gran Tamaao De Una Muestra Aleatoria Simple Representativas De La Poblaciin De Estudio (A Selection Procedure of High Size Sub-Samples from a Simple Random Sample Representative of the Population)