a) Identifying your question
One of the first things you need to clarify when designing a survey is exactly what you want to find out. Start by writing your question as clearly as you can. Include as much detail as possible so that everyone else will interpret the question in the same way as you.

For example, if you wanted to find out "What times do students get up in the morning?" you would need to clarify:

Is it a normal school day, a weekend or a holiday e.g. “What time do students get up on a normal school day?”

Will it matter if students have different school starting times? Compare the question “How long before school starts do students get up?”

What units do you want to use to collect the data? “How many minutes before school starts do students get up?”

How will part units be reported? Do you want data to the closest whole number? If you plan to have fractions can these be decimals?

State your definitions
You will also need to state some definitions:

‘student’ is defined as Year 7 and Year 11 Australian school students

‘getting up time’ is the time students get out of bed on a school day.

It is important that you maintain these definitions throughout your investigation and in any report. If your question is not clearly defined, the participants in your survey may interpret the question differently and your results won't be accurate.

b) Deciding who to include in your sample

Participant characteristics
Next you need to specify the scope of your sample. For example, are you looking at a particular age group or year level or location? In these cases. your question might be “How many minutes before school starts do Year 7 students get up compared with Year 11 students?” or "How many minutes before school starts do students in Queensland get up compared with students in South Australia?".

Sample size
Estimates are made about the total population and subgroups based on the information from the sample. Generally, larger samples will give a more accurate representation of the population. However, it can be difficult to obtain accurate information on smaller groups within the population if the sample size is small.
In addition, the level of accuracy can usually be measured. There are formulae to determine the size of the sample that should be taken depending on the level of confidence required. One of the simplest is: Sample size = √n
(where n is the size of the population)

Randomness
To allow predictions to be confidently made about the total population, samples need to be randomly selected as well as of sufficient size. For data to be selected randomly, each data item must have the same chance of being selected as any other. Pulling data items from a hat or using the random number generator on a calculator are common ways of ensuring that data are selected randomly. Data not selected randomly may be biased towards a particular outcome.

Types of Sampling
There are a number of ways that a sample can be randomly drawn from a population. For example, you may want to ensure that each subgroup of a population is represented in the same proportion as in the general population.
For more information on types of sampling see our Glossary page.

Step 2: Collecting data

Once you have decided on your question, how many people will be in your sample, and randomly selected them, you will need to consider how to collect the data. Will it be through an interview or will you collect written responses?The data you collect will also need to be in a form that is easily organised in order to analyse it. For example, you need consistency in units and fractional answers so request that the data be recorded to the nearest centimetre or half centimetre.

Interview
In an interview, a participant can ask questions if they haven’t understood something.

Written response
Written responses can be completed by many participants at the same time and are quicker than interviews.

Variables
A variable is any measurable characteristic or attribute that can have different values for different subjects – for example, eye colour, distance from school etc.
Characteristic is another way of saying variable. For example, height, age or country of birth are all characteristics or variables of people.

You need to decide what statistics you will use and what summary information will enable you to answer your survey question.
For example:

Will you find the mean or median or both?

Will you exclude extreme values?

When describing a distribution, statisticians usually comment on its centre, spread and shape.

a) Measures of centreThe centre of a set of data is important. Often you want to know which value occurs most commonly or the average of a set of values.

Numerical dataMean: The arithmetic mean is calculated by finding the sum of all values in the data set and then dividing it by the number of values in the set. The mean is often called the average.Median: The median is the middle value of a dataset when all the values are arranged from smallest to largest number. If a dataset contains an even number of values, the median is taken as the mean of the middle two. The median is the preferred measure of centre when the data is not symmetric or contains outliers.

Numerical and categorical data – Figure 14Mode: The mode is the value that occurs most often in a data set.

b) Measures of SpreadThe spread of a data set, when combined with its centre, will give a more complete picture.

Range
The range is the distance from the minimum value to the maximum value in the data set.
Range = maximum – minimum

Interquartile range – Figure 15 Quartiles: As the name suggests, quartiles divide data into four equal sets.
When observations are placed in ascending order according to their value, the firstorlower quartile is the value of the observation at or below which one-quarter (25%) of observations lie. Thesecond quartile is the median at or below which half (50%) of observations lie. Thethirdorupper quartile is the value of the observation at or below which three-quarters (75%) of the observations lie.
Another way to think about this is: the median divides the data into two equal sets: the lower quartile is the value of the middle of the lower half, and the upper quartile is the value of the middle of the upper half. The difference between upper and lower quartiles (Q3 - Q1) also indicates the spread of a data set. This is called the interquartile range (IQR).

Interquartile Range = Upper quartile - Lower quartile
The interquartile range spans 50% of a data set, and eliminates the influence of outliers because, in effect, the highest and lowest quartiles are removed.
Calculating the IQR will help you to identify potential outliers in the data. Any data above Q3 +1.5 x IQR (upper fence) and below Q1 – 1.5 x IQR (lower fence) should be investigated to decide whether or not these observations need to be excluded from the data before it is analysed further.

Outliers
A potential outlier might be a mistake or an extreme value so you need to check the original data to determine if it is to be discarded or retained. However, cleaning data by discarding extreme values might give you a false view of variations that can be found in datasets. In turn, this could result in inaccurate modelling based on the data and false conclusions.

The standard deviation – Figure 16
The standard deviationmeasures the average distance each value in a data set is from the mean. A data set with a higher standard deviation will be more spread out than one with a lower standard deviation. Compare the following two data sets:

c) ShapeThe standard distribution– Figure 17
Data which has a standard distribution is characterised by a symmetric, single curve. Also known as the normal or bell shaped distribution.
You would expect data such as the height of Year 9 students or the length of cane toads to show a standard distribution. A standard distribution results from most of the data clustering around the mean and less data occurring further away from the mean.

Skew – Figure 19
Some non-symmetric data distributions can be described as skewed.
Data which is positively skewed typically has a cluster of lower end values and a taper of higher end values. Data which is negatively skewed typically has a cluster of higher end values and a taper of lower end values.
You could expect data such as the amount of total weekly income to show a negative skew. The number of people on a higher weekly income is fewer than those on low or medium incomes. Data you could expect to show a positive skew might be the amount of hours spent on Facebook by age. It's likely that the amount of time spent on Facebook will peak in the late teens to twenties and then taper off as a person's age increases.

Figure 14: Mean median and mode of a data set

Figure 15: Calculating 5 figure summary statistics

Figure 16: Box and Whisker plot with upper and lower fences

Figure 17: Comparing the standard deviation of two data sets

Figure 18: Standard distribution based on 2011 C@S random sample

Figure 19: Bi modal graph of distribution of height

Figure 20: Positive skew, distribution of age of Indigenous males

Step 6: Drawing conclusions

Communicating the results of your investigation is a critical part of the survey process. Ensuring the accuracy of any interpretations and avoiding misinterpretations are crucial. Keeping in mind the purpose of the investigation and your audience will help to keep your conclusions on track and avoid including unnecessary information.

Accuracy of Data
All your calculations need to be accurate, verifiable from the data and clearly communicated using simple language.

Misinterpretation
To effectively communicate your results, you will need to be aware of avoiding any misinterpretation of the data such as using the mean when the median is more appropriate or not taking seasonal variation into account.

Stating your conclusions
With statistics, there is always a risk that the results you have do not tell the whole story. You can use the following checklist to help judge the reliability of your statistical information.

Do your conclusions communicate the message told by the data?

Are your conclusions based on results rather than on your opinions?

Have you considered alternative explanations for the same results?

Is your report set out logically including using an organisational framework such as headings and sub headings?

Have you included the source of any information you have used or referred to?

Have you included relevant tables and graphs?

Are your findings clear, related to your aim and only contain necessary information?

Audience

Have you considered your audience and used appropriate language?

Have you anticipated questions your reader might have? For example, have you explained unusual or unexpected results? Have you justified your choice of analysis, indicated your sampling process etc?

Can your reader check your conclusions by viewing your analysis?

Sampling Error
Finally, if you intend for your results to be applicable in other contexts, it is important to understand the limits that might apply.
The difference between an estimate based on a sample survey and the true value that would result if a census of the whole population was taken is called the sampling error. Sampling error can be measured mathematically and is influenced by the size of the sample. In general, the larger the sample size the smaller the sampling error.
The way a sample is drawn is also important. In general, a random sample will result in data that is more able to be generalised to the population.

Unless otherwise noted, content on this website is licensed under a Creative Commons Attribution 2.5 Australia Licence together with any terms, conditions and exclusions as set out in the website Copyright notice. For permission to do anything beyond the scope of this licence and copyright terms contact us.