On Likert scales, ordinal data and mean values

This post has been prompted by an edited collection that I was recently asked to review. Substantive comments on the book will be published elsewhere, so you may want to watch this space for updates; but what I want to do in this post, instead, is share some thoughts regarding research methods and the role of reviewers.

Specifically, what sparked my interest was one study in the collection, which used Likert scales to record participants’ attitudes towards a certain educational construct. Those who are not familiar with the fascinating minutiae of quantitative research can find a discussion of Likert scaling and ordinal data in the section that immediately follows. Those of you who are unlucky enough to have studied statistics may want to skip to the next section.

Likert scales and ordinal data

A Likert-type question (or ‘item’) asks respondents to select one of several (usually four or five) responses that are ranked in order of strength. Here’s an example:

Rate your agreement (or lack whereof) with the following statements using the scale below:

Strongly agree

Agree

Neither agree nor disagree

Disagree

Strongly disagree

1

2

3

4

5

Apples are rubbish

Yoghurt is my favourite food

Beans are evil

Fish fingers and custard taste great

Sometimes sets of similar items are dispersed in the same questionnaire in order to probe different aspects of the same construct. When these items are put together, the combined findings can give us information about an underlying quality or belief. I will not go into this in more detail here (but if you feel so inclined, do have a go at finding what the underlying construct is in the example above!) Likert scales are very frequently used to measure constructs like attitudes towards different things, satisfaction rates and more. They are very flexible and very useful, provided you use them carefully.

Interpreting Likert scales

Many researchers tend to use Likert scales to do things that they were never designed to do

Likert items and scales produce what we call ordinal data, i.e., data that can be ranked. For instance, people who select response (1) to the last item above like fish fingers and custard more than people who choose responses (2), (3), (4) and (5). People who choose response (2) like this snack more than those who choose responses (3), (4) and (5) and so on. In addition to being ranked, ordinal data can be tallied: for example, I might want to count how many people chose each of the responses and compare their numbers. This, however, is almost the extent of what one can do with such data.

The problem with Likert items is that many researchers –including the ones whose paper prompted this post– tend to use them in order to do things that they were never designed to do. Calculating average scores is one of them, and here’s why it’s wrong:

Imagine that ten survey participants were asked about their attitudes towards fish fingers and custard. The table below shows a hypothetical distribution of answers:

N

%

Strongly agree

1

10

Agree

1

10

Neither agree or disagree

3

30

Disagree

2

20

Strongly disagree

3

30

The wrong way to do it

If a researcher were interested in knowing the beliefs of a ‘typical person’ (whatever that might be), they might be tempted to calculate a mean score for this data. The formula one uses to calculate weighted means is:

[(number of people who selected response 1)*(weighting of response 1) + (number of people who selected response 2)*(weighting of response 2)… (number of people who selected response n)*(weighting of response n)] / (total number of respondents)

In the example above, this would yield:

(1*1)+(1*2)+(3*3)+(2*4)+(3*5)/10 = 3.5

Going back to the descriptors, the researcher would then ascertain that a response of 3.5 corresponds to something between ‘no opinion’ and ‘disagreement’. They would therefore pronounce something along the lines of: ‘Our study revealed mild disagreement regarding the palatability of fish and custard (M=3.5)’.

A better way

Plainly put, the option suggested above is statistical nonsensenot an optimal interpretation (update: I feel less strongly about this than I used to in 2013, but I still think it is usually wrong). Such an argument relies on the assumption that the psychological distance between ‘strong agreement’ and ‘agreement’ is the same as that between ‘agreement’ and ‘no opinion’. Similarly, it seems to imply that the distance between ‘agreement’ and ‘strong disagreement’ is four times greater than that between ‘agreement’ and ‘strong agreement’. The mathematical model needs these assumptions in order to work, but they are simply not in the questionnaire design. And even if we forced them into the questionnaire, that would constitute a gross distortion of psychological attitudes and the social world to fit our statistical mould.

Ordinal data cannot yield mean values. If you think they can, do so at your own risk.

To put it in the simplest terms possible: Ordinal data cannot yield mean values. If you think that they can (and some statistics guidance websites might encourage you to think so), you can still take your chances, but make sure you justify your choice well when you write up your methods section. What you want to do instead, if you are interested in finding what the ‘average’ or ‘typical’ response is, is look at the median response. The median is a type of average value, like the mean, except that it shows the number that is exactly in the middle of the data, i.e., at the same distance from the highest and lowest value in the dataset. You can find out more about how to calculate the median here.

On the review process

As I wrote at the beginning of this post, one of the papers in the volume that I reviewed reported on findings that had been generated by extracting mean values from Likert questions, i.e., by subjecting ordinal data to a type of analysis that they don’t support. In the authors’ defence, they were neither the first nor the last to engage in this controversial practice: averaging ordinal data is as widespread as it is wrong. Unfortunately, this problem had gone unnoticed by the editors of the collection, and by the peer-reviewers employed by the press. As the book had already been published, I was left wondering whether there was anything to be gained by flagging it at this stage.

Readers often take it on faith that the people who conducted a study knew what they were doing. This faith is sometimes misplaced.

It is the nature of the peer-review process that papers are often reviewed by people who can make intelligent substantive judgements on the findings, but may not always have the requisite background to comment on the research process. For better or for worse, research methods are too diverse and too specialized for reviewers to have more than a passing acquaintance with most of them. In addition, there are limits to the time one can reasonably spend providing unpaid service to the profession, and these often preclude reading up on research methodology every time one comes across a novel research design. Now and again, reviewers have to take it on faith that the people who conducted a study knew what they were doing, and they must trust that there are no major flaws in the methods. So, rather than double checking on such matters, we tend to focus our feedback on more substantive aspects of the research (e.g., Are the claims made commensurate to the scope of the study? Do the findings add significantly to the existing body of knowledge?). Mistakes in the methodology will, on occasion, slip by.

So what is one to do when one has to provide an informed opinion on the quality of a study that has a major flaw, bearing in mind that the people responsible for finding this flaw failed to spot it, or deemed it unimportant? In this case, I decided to let it pass: the findings of the study were inconclusive and broadly consistent with what was already known about the phenomenon in question. I thought that there was little harm in having in the literature one more voice that added a weak agreement to the prevailing views – even if this voice was not informed by very sound empirical evidence.

If there is a take-home message from all this, it is that readers should once again be cautioned against putting too much faith in the published literature: just because something has been printed, it isn’t always right.

Post navigation

171 comments

Hi thanks for your informative blog post. I stumbled across your blog when searching for “averaging ordinal data”I also had a tendency to calculate the mean of ordinal data that. I have data from a survey with 5-point Likert scale and I want to try to summarize them.Actually I wanted to find out a measure of central tendency. So I tried summarizing the median but I realized the median may also not be 100% correct.

Say I have 6 respondents and say 3 respondents answer “neither easy nor difficult” and 3 respondents answer “difficult”.
If I were to calculate the median from this data (3,3,3,4,4,4), I would get 3.5.

How would I interpret 3.5? It is also halfway between “neither easy nor difficult” to “difficult”.

In this case, do you think I should just the mode as a measure of central tendency?

Hi Ilma! Ordinal data have a median and a mode, but not a mean. This means that your calculation of the median is, statistically speaking, appropriate, even if it is -perhaps- difficult to interpret. Calculating the mode would also be appropriate. Which of the two measures is most suitable would depend on your research questions.

In my opinion, the best way to present a data set such as yours would be as a bar chart. This would help readers see how all the responses are distributed, without any loss of detail.

However, if you want to condense this information into a single number, you’ll have to use a measure of central tendency: the mode or the median. SPSS, or any equivalent statistical package, will generate both values for you, and I believe MS-Excel also contains a function for calculating these values. Even if you don’t have access to such software, your dataset is not too big, so it should be easy to calculate them manually.

The mode is the easiest to calculate: it’s just the option that most people chose. To calculate the median, you will have to arrange your responses in order of magnitude e.g.,

1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4,4,5,5,5

Then you start crossing out one number from the end of the list and one from the beginning, and repeat until you are left with a single number (or two). That number is your median.

After describing the distribution of responses and their central tendency (mode/median), you would normally want to compare the two groups. If the responses are very different, you should then try to account for what might be causing the difference.

Hi sir
I am using likert scale in a survey as
1=fully implemented
2=partially implemnted
and 3= not implemented yet
i want to analyse using the mean value. can you tell how to calculate the range for means to denote them to likert scale according to mean range?

The scale you are describing is not a Likert Scale. It is a simple ordinal scale, and as such it does not yield a mean. You can calculate a median value, following instructions found elsewhere in this blog.

i am very thankful to you sir thanku very much.
plz tell me the difference b/w likert scale and ordinal scale as i have to defend myself and also share how to calculate ranges for means of a likert scale…. thnx alot

It is hard to answer your question without reference to any particular study. The mode is the answer most people selected. The median is the answer furthest from both extremes, a ‘middle ground’ of sorts.

However, both numbers are just answers. They only make sense in the context of the question that they are supposed to be answering.

There are several statistical procedures you may want to use, depending on your research questions. If you need a measure of central tendency, you will have to calculate the median. You can refer to previous comments for instructions and examples of how to calculate that.

Hello, my case is a bit different: I applied a 4-item scale anchored from 1(strongly disagree) to 5(strongly agree) to evaluate a certain variable (eg. attitude toward drug consumption) in an experiment with 80 subjects. Have I made a mistake by calculating the mean of answers of every single subject for that scale in order to summarize the data and create a macrovariable “attitude”?

Say Peter´s answers were, 2, 3, 4, 2. Peter´s mean for the variable attitude: 2.75. Can I work with that having in mind that the objective of the study is to observe relationships between different variables (use ANOVA; correlations, etc)? Thanks in advanced.

I’d hesitate to answer without more context. My feeling is that for this kind of calculation to work, you need to assume that the ‘distance’ between “strongly disagree” and “disagree” is the same as the one between “disagree” and “not sure”, and so on. If you can convincingly argue that this is how your respondents interpreted the scale (perhaps the wording of the question encouraged them to think so?), the mean attitude might make some sense.

Otherwise, I think it would be preferable to use the median in each set of scales: So, to use your example, if Peter’s answers were 4, 3, 2, 2, the median would be 2.5. In most cases you will find that the difference is small in practice, but -in my opinion- it is theoretically important.

If re-calculating the variables is no longer an option, I think that the best way forward would be to acknowledge in the write-up that this was not the statistically optimal treatment, and then go on to argue that it is nevertheless a robust procedure that will generate plausible results, even when its underlying assumptions are violated.

Hello
I’m doing this research on the Acceptability of an English Speaking policy on campus. Our questionnaire is composed of mainly likert scale with a 5 point scale where
1=Strongly Disagree
2=Disagree
3=Neutral
4=Agree
5=Strongly Agree

For example, we are trying to find out if the respondents agree with using English during lessons only.
The number of respondents who chose 1 are 2, those who chose 2 are 9, those who chose 3 are 24, those who chose 4 are18 and those who chose 5 are 7

The total number of respondents is 60
how do I interpret this data? Thank you

Hi, I have conducted a survey to see what influences people in choosing a destination… One of the questions is a likert scale to measure likeliness in visiting their dream destination, of 10 points, with 1 = not likely at all, and 10 = very likely (i have not given a definition for the points in between)… does this mean that i can calculate the ‘mean’? what statistics should i calculate pls? thanks :)

The short answer is, yes, you can calculate the mean. Here’s a somewhat longer one, which explains why:

To begin with, I am not sure I’d call this type of item a ‘Likert’ item. A key difference is that Likert items are bivalent (i.e., they extend in two directions) and symmetrical, which is not the case with your item. Moreover, the ranks of a Likert scale could extend infinitely on both directions, which is not the case here. What you have is a rating scale that measures the likelihood of something happening.

The way I understand it, likelihood or probability can, in principle at least, be measured on a continuous cline ranging from ‘impossible’ (0%) to ‘certain’ (100%). Your scale reflects this, except that for reasons of simplicity, you sensibly asked respondents to position themselves on a scale with 10 equidistant (evenly spread out) anchors. This type of rating scale generates what we call ‘interval’ data. These data are just like the ones that Likert-type items produce, with the added advantage that -by virtue of being evenly distributed- they can be used in more sophisticated ways. This includes reducing them to a mean value, and calculating the standard deviation to how widely the responses are spread around the mean.

I’m currently analysing data from sensory evaluation which I carried out on three different types of fluid thickeners with 45 participants.I asked participants to rank each drinks from Most Acceptable = 1 to Least Acceptable = 3 for the characteristics of taste, texture and appearance.
Overall acceptability was assessed by scoring the ranks for taste, texture and appearance. The lowest score that each thickening agent received out of the maximum possible score of 9 indicated the most acceptable overall while the highest score indicated the least acceptable overall.

I’m struggling to carry out a Friedman test on my data with my data measure down as “ordinal” in SPSS version 21.0. However if I put my data down as being “scale data” the test runs. I’m wondering if it it would be appropriate to run my test with the data measure down as being “scale” data.

I am afraid that I am not familiar enough with the Friedman test to be able to advise you.

My understanding is that it can work with both ordinal and continuous (scale) variables. If you keep getting error results, this might mean that there is something in your data which violates the assumptions of the Friedman test. Unfortunately, without access to your data I am not able to tell what that might be, and chances are that you’d need someone more qualified than me to figure it out anyway.

While I suppose that you could change the settings to make the analysis work, my feeling is that doing so might be akin to sweeping the problem under the proverbial carpet. I think that it may be more useful to have the data checked by someone who is more familiar with your research, and find out what is causing this unexpected behaviour.

Thank you for a useful and informative post (which I’ve just discovered) – though I disagree with your strong view of Likert scales always eliciting ordinal data.

*The* problem is assuming that the interval between two adjacent response options is always the same. This doesn’t make sense when labelling all the options, as this clearly makes the data ordinal (or nominal). However, if only the first and the last response options are labelled and the respondent is asked for the strength of their reported opinions/ feelings (e.g., on a scale of 1-6, where 1=very bad and 6=very good), then the intervals can be assumed to be equal.

I am not hoping to persuade you – I just think it is fair that this alternative point of view is added to this discussion.

Copy-pasting below something I wrote about this a while ago, with some references:

‘There has been some controversy regarding the nature of the data produced by self-reported scales, these being considered a grey area between ordinal and continuous variables (Field, 2009; Kinnear & Gray, 2008). Although attitudes and feelings cannot be measured with the same precision of pure scientific variables, it is generally accepted in the social sciences that self-reported data can be regarded as continuous (interval) and used in parametric statistics (Agresti & Finlay, 1997; Pallant, 2007; Sharma, 1996). […] Blunch (2008, p. 83) maintains that treating self-reported scales as interval/ continuous variables is most realistic if the scales have at least 5 possible values and the variable distribution is “nearly normal”.’

Hello, please help me understand this clearly. i have 9 questions sent to 32 respondents and they answer according to 5 scaled views like the likert scale. 1. very significant 2. significant 3.somewhat significant 4.somewhat insignificant 5.not significant. please help me in analyzing or if there is any other example close to what i have stated. i wish you can respond to me very soon. thanks

Hi there. I’m designing a protocol on the relationship between income and happiness in terms of (a) how people spend and (b) whether they feel society’s culture influences them. I’m using a 5-point Likert scale (very happy, happy, indifferent, unhappy, very unhappy) in my questionnaire and have four different sections: questions on material spending, questions on investing, questions on charitable spending, and finally questions on cultural influences. Would there be a specific analytical technique you would use to interpret data like this other then the median, mode, IQR? I keep coming across chi squared test, would that be suitable? I really enjoyed your piece by the way, it made it very easy to understand.

Thanks for your kind words. There are, roughly speaking, three things you can do with your data:
1) describe each variable separately (this is where you’d use the median & IQR)
2) check to see if there is a correlation between variables (Spearman and/or Pearson correlations)
3) check whether different demographic groups in your sample (e.g., men vs. women) respond in different ways (this is would involve a cross-tabulation)

An x-square test is a way to see of the differences observed in a cross-tabulation are statistically significant. It works best if you have a large number of respondents, and few categories to cross-tabulate. When you use SPSS to run a cross-tab, you get a ‘contingency table’ with one variable in the rows and one in the columns, and each cell displays an ‘expected’ and an ‘actual’ value. The ‘expected’ value should be >5 in most cases, otherwise the test gives inaccurate results. In such a case, Yates’s Correction is a good alternative to use.

Thanks for the quick reply, I really appreciate it. I have a potential sample size of 1500 people, will this be enough to use the x-square test? And finally I also study at the University of Manchester so I completely trust your judgement!

Hi there!
I have a quick silly question regarding the correlation of different statements. I would appreciate any help!
How can I interpret this correlation?
For example, I have 7 statement within one block (5-point Likert-scale) and respondents have to choose between whether they totally agree, agree, …and s.o.
SPSS Correlation (Spearman) showed that for statement no.1 and no.7 exists correlation of ,657**.

1)”It is difficult for auditors to understand the main risks of fair value model”
7)”It is difficult for auditors to understand the model, because of the lack of specific education”:

How to correctly describe this correlation when reporting?
An if it is a negative correlation, then how to report it?
(for instance ,-556)

A positive correlation between two statements means that people who agree with the one tend to agree with the other as well. A negative correlation means that people who agree with one statement tend to disagree with the other. Correlations range from -1 to 1. The further away from 0 that you get, the more stronger the correlation is. The double asterisk you got means that your results were statistically significant; results that haven’t been flagged with asterisks are not conclusive.

Thank you for this great post and for the effort you make to reply to all of the questions. I used a 3- point Likret scale (agree, disagree, uncertain) to measure attitudes towards writing. I feel that my data should be nominal. Is that correct? Another thing, it gets me confused when I try to interpret the mean as some studies treat the mean of 2.3 as uncertain.

Nominal/categorical data are data that are categorically discrete (e.g., baby names, places of birth etc.). You could treat your data as nominal, if you want, but an argument can be made that these responses can be ranked: someone who ‘agrees’ has stronger beliefs than someone who is ‘uncertain’. If you treat the data as nominal, the only measure of central tendency you can use is the mode (i.e., you can only report which answer is the most common). If you treat them as ordinal data, then you can also calculate the median, which is a slightly more refined measure of central tendency. Either way, I would not recommend calculating a mean. The studies you mention seem to calculate a mean for these data, and -unless they have a good rationale for doing so- may be statistically suspect.

Sir,
Could you please tell me how to find the expectation and perception score from SERVQUAL dimensions.I’m researching on service quality of a telecom firm.I’m doing the analysis in spss.
I have to calculate average tangibilty score which is a 5 point likert scale.But under tangibility there are 5 conditions which means I have 5 questions under Tangibilty dimension and it has 5 point scale for each question.I need to find the average tangibilty score from these 5 questions.Please help.

Could please give me your opinion?? I am supposed to finalize my paper! And I am getting worried…I understand that calculating the mean of likert scale is not ideal. But i what i did is generating the total score of each respondent on the scale items then got the average (to compare 2 groups)

I have a likert scale of 3 with 16 scale items. I used the data to present them in frequencies statistics and also I used chi square to establish association between two groups.

But then I also presented my data in mean and St.dev to have an idea on the skewness of each group.

I calculated the total score of respondents answer to the scale item by adding :number of Agree*3 + number of Uncertain*2 + number of Disagree*1= total score

then i convert it to a scale of 100 : total score*100/48. 48 comes from 16 likert items * 3

At last i calculated the average of total scores in each group with a cut off points (32) of the uncertain point (16*2). so this way i could interpret the tendency(skewness) of each group to be in strong agreement.

Thanks for your kind words. I am sure your supervisors will be able to give you better advice on the appropriate techniques to use with your data set, but here’s my two cents.

I am not sure I understand what you did to calculate for the skew in the distribution, but it’s good to see that you have taken some measure to correct for it. To be honest, I am still skeptical about the way you calculated the averages, for the reasons that I have outlined in the post. However, since you are now finalising your assignment, I think it would probably be a bad idea to re-do the statistics at this point. What I would do, in your place, would be to add a couple of lines in the methods section acknowledging that averaging the scores of Likert-type items is controversial, but that it has been used extensively in social sciences (here are some references: http://achilleaskostoulas.com/2014/04/07/on-likert-scales-levels-of-measurement-and-nuanced-understandings/), and then go on to explain what you did and why you think it’s appropriate.

Hello Sir,
I have collected data through a questionnaire. The data is in five-point likert scale (from 1. strongly disagree to 5. strongly agree). with Four Constructs. Each construct has four questions. I was told that I can convert this ordinal data to interval scale data by assigning points to each response (e.g. i point to strongly disagree, 2 points to disagree, 3 points to neutral, 4 points to agree and 5 points to strongly agree). By doing so I can then treat this data as continous data and carry out linear regression. Any thoughts. Many thanks

Hi, It seems to me that what you are describing is a Likert scale ‘proper’, i.e., a composite score derived from multiple related items. Each item only produces ordinal data, but arguably the composite score is what we call interval data (NB not continuous!). Interval data are just like ordinal ones, with the added property that the points that make up the scale are equally spaced. You should be able to run linear regression with such data.

hi sir, i am doing the research with five point likert scales. may I noe Likert scales is consider as continuous scale or categorical scales? As I need to know use what test suitable to compare my demographic profile of respondents with my dependant variables? Thanks sir

Respected sir ,Hi thank you sir for your response,still I want to know few more things ,I am student and fresh in research work.I have administered Liker t scale to 100 teachers .I have collected data.Please tell me how to measure median and mode,and how to express variability in terms of the range and inter quartile range.And how to display data in a dotplot.

I need to determine agreement on usefulness of a terminology consisting of about 500 terms. 45 participants (in 3 groups of 15 each) will evaluate each one of 500 terms from strongly agree, agree, neither agree nor disagree, disagree, and strongly disagree. My thought was to calculate the percentage of items per participant that are rated either agree or strongly disagree, and compare percentages by group.

Pls, the title of my project is Abusive supervision and self efficacy as PREDICTORS OF Organisational Citizenship behavior (OCB).

The specific objectives of the study are as follows:
i. To establish the relationship between abusive supervision and OCB.
iii. To establish the relationship between self efficacy and OCBs.

The Scale I used (pls, I do not want to include demographical section here to minize space):

Abusive supervision was measured using a 15-item scale developed by Tepper (2000). Respondents indicated the frequency with which their supervisor exhibited certain abusive behaviours on a 5-point Likert-type scale which ranges from, 1(I cannot remember him/her ever using this behaviour with me) to 5 (He/she uses this behaviour with me). Sample items include “My boss ridicules me”, and my boss puts me down in front of others”

Self Efficacy: A 10-item scale developed by Jerusalem and Schwarzer (1995) to assess a general sense of perceived self-efficacy with the aim to predict coping with daily hassles as well as adaptation after experiencing all kinds of stressful life events was utilised. It is a four point response scale ranging from 1(Not at all true) to 4(exactly true). Sample items of the scale include “I can always manage to solve difficult problems if I try hard enough to “I can usually handle whatever comes my way”.

Organisational Citizenship Behaviour Scale: The instrument utilise in the measure of OCB is a 21-item scale developed by Onyishi (2007). It is used to determine the degree to which people engage in discretionary behaviours that go beyond the formal requirements of job. It is a self-report Likert format response scale ranging from; 1 (Never) to 5(very often). The scale measures two dimensions of OCB; citizenship behaviour directed at the Organisation (OCB-O) and citizenship behaviour directed towards specific individuals in the workplace (OCB-I)

Please, my challenge is how do i do the correlation analyses (i.e compute the 21 items in OCB scale to 1 so as to regressing Abusive supervision and self efficacy with OCB).

Pls, My challenge:
1) Should I regroup the 15 items in section B to become 1 item, 10 in section C to become 1 item… and 21 items in section C to become 1 item using the spearman’s rank corr. Coeffient. formula. Then regress Abusive supervision and Self efficacy on OCB ?

1) Should I use the spearman’s rank corr. Coeffient. formula to compute the 15 items(questions) under abusive supervision, and the 10 items (questions) under self efficacy… and 21 items(questions) under OCB. Then carry out multiple regression of Abusive supervision and Self efficacy on OCB ?

2) pls, I do not understand what rank 1= ?, rank 2 =? data 1=? and data 2 =? since I have 3 data set of data on my questionnaire (i.e Abusive supervision, Self Efficacy and OCB)

3)pls, do u mean that once i use spearman rho, I do not need to use linear regression on my data any more?

Hello, I am reading some examples of likert scale scoring ranges and I came across a question that I wanted to confirm my answer with you. The question is as follows; “what is the range of scores on the the following likert scale” The aforementioned scale is a 20 item, 5-point likert-scale. I take it that the rangeof scores should be 20-100, is this correct? Or am I forgetting something?

Hello!
In a survey conducted using questionnaires, the questions were done with 2 Likert scales for each question; one scale indicated day to day and the other on emergencies with 15 respondents.
Am trying to identify the leadership style, motivation style and hygiene factors level of importance on emergency and day to day activities.
How can i calculate the range and reliability for these data?
Thanks!

The range, for each item, would be the difference between the lowest and the highest response that your participants gave to each item. If you view the items as a single scale, then you can sum up the answers given by each participant, and estimate the difference between the respondent with the highest score and the one with the lowest. That would be the range of the scale.

I am not entirely sure what measure of ‘reliability’ you are interested in. It is common to estimate the internal consistency of each scale by calculating Cronbach’s alpha (α) coefficient for the entire scale. This is a number ranging from 0 to 1. If a scale has high internal consistency (say, α >.80), this means that respondents who strongly agree with one item, tend to strongly agree with most other items as well. Similarly, respondents who disagree with an item tend to disagree with most others, and so on. If the alpha is too low, it is possible that there are a couple of items in the inventory which are too different from the others. In such cases, you will need to find those ‘rogue’ items, and either discard that data, or treat it separately.

hi sir.im conducting a survey for my final paper.Im using two questionnaires-one is a three point likert scale and the other one is a five point likert scale.My study is about learning styles and language learning strategies.How am i supposed to find out the extent of relationship between these two variables and the extent by which my subjects utilize these variables?

Hello sir, I wanted to ask on how to key in the data for my research. You see, my group are doing research about city bus users’ experiences on the safety and technical perspective. We are using likert scale with two different categories of questions which is safety and technical itself, where each category has its own questions. Such as, ‘the bus condition is clean’ as a part of technical and ‘I’ve ever been harassed in the bus’ as a part of safety. So how do I calculate from the respondents’ answers and how do I key in in the SPSS? Thank you.

I am not sure I understand your question. You should normally define a different variable for each questionnaire item. SPSS will automatically name the variables for you, but you can use names such as SAF01, SAF02, TEC01, TEC02, etc. for convenience. Each variable should have several values (one for each possible answer, such as ‘strongly agree’, ‘agree’, and so on). You can use numbers to correspond to these values (e.g. Strongly Agree = 1, Agree = 2 etc.). Then you write in the answers given by each participant. That is, Questionnaire 1, has replied “1” to SAF01, 3 to SAF02 etc. That should take care of keying in.

To answer “how you calculate”, I would need to know what you are trying to find out. It is your research questions, not the data, that drive the methods.

Hello sir, I am working on my paper which aims to determine the level of students’ awareness on disaster risk reduction (specifically in the following natural disasters: earthquake, fire, flood, and typhoon). A 48-item test was made to determine their level of awareness. It is composed of 12 items per disaster (3 questions for earthquake prevention and mitigation, 3 questions for earthquake preparedness, 3 questions for earthquake response and 3 questions for earthquake recovery and rehabilitation). This would sum up to an item of 48 questions.There are 167 students as my respondents (students from different grade levels Grade 5,6,7,8,9). Please help me on how will I used the data collected to interpret the level of awareness. The data I gathered were the scores of the students per grade level and the number of errors per item. Would it be appropriate to use likert scale on the level of awareness (1-not at all aware, 2-slightly aware, 3-moderately aware, 4-very aware,5-extremely aware) and assign the range of scores per level of awareness or do I need to use a statistical tool to interpret my data. Please shed a light to my concern. thank you and more power.

If I understand correctly, you have two questions: (a) what kinds of statistical procedures you need to run in order to analyse your data, and (b) how to do these procedures (i.e., what ‘statistical tool’ to use).

The methods of analysis you should use depend on your research questions, i.e., what you are interested in finding out. I am afraid that I cannot answer your question without knowing what the research questions are. However, you will likely need to estimate the central tendency and dispersion for each item: you can find some ideas here. You might also need to compare responses given by different groups of your sample, e.g. responses pertaining to earthquakes against responses pertaining to fire. This is done by crosstabulating: creating a contingency table where one variable (e.g., Grade) forms the rows and another (e.g., questionaire item) forms the columns. Ideally, you should also run a test like x-square to determine the statistical significance of your findings, but it seems that you have too many contigencies and the sample is not large enough for that. It might also be possible to conflate some of the items to reduce the number of contingencies, but I can’t advise you about that without knowing how your questionaire items were worded.

As for the latter question, a statistical package such as SPSS could make your work much easier, but if you don’t have access to that, many of the procedures can be done with widely available packages, like Excel. There is some advice about that in this post.

My sincere gratitude for your quick reply to my concern. I am interested to know the level of awareness of my students with regards to earthquake, fire, flood and typhoon along the four thematic areas : prevention and mitigation, preparedness, response, and recovery and rehabilitation.
Specifically this is the statement of my research
The study seeks to analyze the integration of risk reduction in the science curriculum along prevention and mitigation, preparedness, response, recovery and rehabilitation.
Specifically this study seeks to:
1. Determine the level of awareness of students on disaster risk reduction
2. Identify the integration of disaster risk reduction concepts in the Science Curriculum along the four thematic areas.
3. Identify the strategies in integrating disaster risk reduction as utilized by the teachers.

The first question will be answered using the 48 item questionnaire (multiple choice type of test with the table of specification).

Thanks Erica! I haven’t used a reference because I was writing from memory (In the post, I only intended to mention statistics in passing). That said, I think that most statistics books will have some reference on how to calculate the mean (e.g., Muijs 2004: 99). You will probably find it harder to find a reference about calculating the mean for likert scales in particular, because it is considered by many statisticians to be a poor practice. This post contains some references to statistics books where it is argued that you can, in some cases, calculate the mean – I am not sure whether any of them provides a formula though.

Thanks for your quick answer. I’m doing a survey in my area of research. I used the Likert Scale method to formulate the questions. But I’m having difficulty finding the best way to analyze the responses. I’ll look the references suggested. Thank you for your time.

I used a survey to collect data on a 5-point Likert scale. my research operating model is based on Structural Equation Modelling (SEM). I am planning to use SPSS AMOS to do the analysis. The question is, how can I use my Likert scale data in the AMOS? Please help.

Good afternoon Sir,
I am assessing the level of awareness of my students about disaster. I formulated 48 multiple questions (12 questions for earthquake, 12 for typhoon, 12 for fire and 12 for flood). I have 167 students (grades 5-9). In the analysis of my data is it possible to assign values according to the level of awareness of the students

How are you Achlleas? i cinduct my research work in the topic of corporate social responsibility of the hotels. the aim is to identify the corporate social responsibility practices of the hotels. i use five point likert scale (yes, occutionnally, no, dontknow and not applicable and five point strongly agree… strongly dis agree ). i want to use mean point to identify the activities in cut off point . thus. it is possible to use to reject and acdept the mean points below and above. thanks in advance .

Hi Tesfaye. A few points:
1. The first scale you describe is not a Likert scale.
2. Both scales produce ordinal data, so calculating their mean is problematic. You should use the median instead.
3. I am not sure I understand what you mean by cut-off point.

Cut off point means to identify the activities of the hotels based on the responses,add the five scales (5+4+3+2+1÷5=3).BELOW 3 taken as negative and rejected above 3 teken as positive and accpted .based on this i identify the activities of the hotel corporate activities. Thank you. There is olso one more question. Very larg level, large level, nolevel, low-level, very low level questions considered a likert scale questions. Thanks very much. I like your comments

Hi. I am using an eating behaviour questionnaire with 1-5 Likert response scale. The questionnaire has 3 subscales measuring 3 dimensions of eating behaviour. Can I use the mean scores for each subscale for further analysis? I am unable to use the total of scores of each subscale as there are missing data points in some questionnaires. Also the distribution of the responses (mean scores) for each subscale are highly scewed. I have a sample size of 500+ . Can I use Pearsons correlation or do I have to go with a Spearmans? Thanks

The first problem to address is the missing data. These have to be filled in. There are several ways to do this (e.g., insert the mode, use a central value as a default answer), but if you are working on an academic paper you must document how you addressed this problem in the methods section of your paper.

Regarding the other questions:
– It’s best not to use the mean.
– I do not know why the responses are skewed, and cannot advise you further without consulting your data, questionnaire and research questions.
– It depends on what part of the data you are using.

Thanks for the advice. On the missing data, when you say insert a central value do you mean the median? Can I insert the mode/central values in SPSS? Also would it be wrong to replace the missing values with the ‘Persons mean’ or ‘Item mean’ as suggested by Downey & King 1998.Also how about multiple imputation? (sorry about the string of questions!).
I used the means scores in analysis as this was the method used in papers by the Original author of the questionnaire as well ( The dutch eating behaviour questionnaire).I presume their data followed a normal distribution. Im my case the data is scewed as many responses were low values (1 or 2) on the likert scale for certain eating behaviour subscales.

Inserting the median value for each item (the item median) is one option. Another one is inserting the median value for each person, assuming that your scale is cohesive. Or you might argue that participants who didn’t answer have no strong views either way, so you would insert the central value (i.e., 3 in a five-point scale). Using Person and Item means are also reasonable options, but I would need to know more about your data before recommending either, and I am also rather wary about using means in ordinal data. Overall, there is no single ‘right’ method. Rather, there are advantages and disadvantages associated with each alternative, but going into them would involve an extended discussion, which is outside the scope of a blog comment.

It is possible that the data in the original paper were normally distributed, in which case an argument can be made for using parametric methods, but it seems like a bad idea in your case.

Gud eve sir. how should I use the 8-point Likert scale to compute the results of my questionnaire taken from Mohammadi’s research? The questionnaire consisted of 27 items which asked learners to rate their replies on an 8- point Likert scale ranging from “Strongly Agree = 8” to “Strongly Disagree = 0”. It assessed learners’ views towards second language learning on five areas, namely self-image, inhibition, risk-taking, ego-permeability and ambiguity tolerance. thank you.

I am afraid that I am not familiar with the research you are describing, so I cannot comment on it. The way you will use your data depends on your research questions – not your instrument, i.e. it depends on what you are trying to find out with your research.

This blog post has really sorted most of my concerns regarding data analysis. But I still have a few queries, and the list is a little bit expansive.

I am doing a research on perception of credibility of online information and news.

1. I am asking respondents the degree with which they agree to a few statements, viz. “online information is believable” on a scale of 1-10, where 1 means not believable at all, and 10 means 100% believable. There are four similar statements, and their mean will make up the credibility index.
So should I consider it as a simple rating scale or as a likert scale? And if it is a likert scale, should the analysis be done by taking the median?
Previous researches have used mean to calculate the same.

2. There is another series of questions on a 1-10 scale, where I ask respondents to tell, “how often do they do the following online: “check information’s author”, “check author’s credentials”, etc.”
So how to analyse this set of data? My inclination is towards calculating a mean of this set of data.

[…] agree nor disagree based on a simple math that the mean of two 2s and one 5, (2+2+5)/3, is 3. Mr. Achilleas Kostoulas explains is in more details in his blog. Many others point out the limitations of Likert scale, in […]

I am sorry, I am afraid I cannot answer this question because it’s too you vague. To be able to give you useful advice I’d need to know more about your research questions, your sample and your data. However, you might find some help in this post.

Very well written. I have trouble making people believed that what they did in using average or weighted average for likert scale is not appropriate. They just mentioned that their Universities lecturer taught them to do it this way. I wonder whether I should write to the university and tell them that they are wrong.

I did a survey about retargeting and want to investigate the influence of cookie knowledge and privacy concern on the attitude toward retargeting. I used one true/false scale (about cookie knowledge) and 4 likert scales (about attitude toward retargeting – 6 questions, 5 point items -, about privacy concern – 5 questions, 7 point items -, about attitude toward advertising in general – 9 questions, 5 point items – and about persuasion knowledge – 6 questions, 5 point items -).

I already put them into SPSS and measured the median of each scale. Now I would like to investigate if a higher cookie knowledge means a higher attitude toward retargeting and if age has an influence on this (e.g. do younger people have a higher cookie knowledge and this a higher or lower attitude toward retargeting).

I don’t know how to look into this the right way and what kind of analysis I need to use to get the right information. I hope you are able to help me.

One way to approach this problem would be to run a cross-tabulation. This would compare the actual responses of people in the ‘yes’ and ‘no’ categories, against what they might have responded if cookie knowledge did not influence responses. You can find instructions on how to do this here. You can confirm whether the difference is statistically significant by running a test called ‘chi-square’. To do this, you need to check the ‘chi-square’ box in the ‘Cross-tabs: statistics’ dialogue box (the second one in the webpage that I linked). Best of luck with your project.

Hello, thank you for your life-saving post! I tried to combine data for some questionnaire questions and summarize my data using the information above, but the median value is like 2.5 or 1.5 for same items. Mine is quantitative and 5 Likert scale data. What should I do now? How should I interpret my data?

Thanks for your nice comments. Having a decimal in the median is not unusual, and you shouldn’t worry about that. Now, as for interpreting the data, it’s difficult to give advice without knowing more about your project and dataset.

The 2.5 median seems to suggest a balanced set of opinions. This could be because most people answered near the ‘centre’ or because responses were evenly split between very positive and very negative ones. The IQR could give you a clue about that.

hi sir archilleas!
You’re a great help to all of us seeking for an answer to our questions.
i need ur ideas on the tools to be used in Experimental Research entitled: ANTI-PRURITIC ACTIVITY OF PLUMERIA ACUMINATA (KALACHUCHI) BARK LATEX IN ALBINO MICE WITH INDUCED LOCAL PRURITUS.
there are 5 treatment groups (6 mice each group) are compared, 2 controlled groups (positive controlled and negative controlled) and 3 experimental groups (50%, 75%, 100% solution). I want to find if there’s a significant difference between 2 groups and also among 5 groups. Should i go for parametric stat or nonparametric stat.
Considering, 4-point Likert Scale is used (0-no pruritus, 1-mild but not causing impairment, 2-moderate causing impairment, 3-severe causing sleepness nights) which is ordinal.
Another 4-point scale is used (0-complete relief, 1-significant improvement, 2-mild improvement, 3-no improvement).
Pls enlighten me.
Thank u in advance and more power!

Hi! I just wanted to point out that the scale you are describing is not a Likert scale, so the caveats I’ve discussed in my post do not necessarily apply. Rather, this is an ordinal (or arguably interval) scale.

Your choice of statistical methods (parametric or non-parametric) will depend on whether you consider the scale to be ordinal or interval; it will also depend on the distribution of your data. In case of doubt, if go for non-parametric procedures, just to be on the safe side.

please I need your opinion on this. I conducted a research involving 668 respondents using questionnaire with modified 4 – point likert scale. The questionnaire contained 50 items arranged into 10 sections. I analysed my data using frequency count and percentages which to me shows more clarity. But an argument came up that the mean and standard deviation was the appropriate technique for the kind of data I collected. Please, what’s your opinion? Also, what do you think about 4-point likert scale? Thanks.

1. The mean and standard deviation could be argued to be appropriate, as long as your ten scales are internally consistent (i.e., have a high Cronbach alpha), and, ideally, if there responses are distributed normally. In such a case, these measures would give you a more refined picture of your data, compared to modes and IQRs. I would still argue that, on theoretical grounds, it’s still wrong to use such measures with ordinal data, but there are convincing arguments (including Liket’s own thesis) that a well-crafted scale (i.e., a composite of multiple items) will produce data that behave as if they are interval/continuous – so it makes practical sense.

2. Personally, I prefer using 4-point scales, because they make a person choose a positive or negative response (they are sometimes called forced-choice scales). As a result, your data are less likely to show the effect of ‘central tendency bias’, i.e., the tendency among respondents to select the ‘neutral’ response. A possible counter-argument is that doing so will lose some of the granularity of the scale. So, I guess, this is not a question of choosing a ‘right’ or ‘wrong’ method, but rather the method that is a better fit to your research needs.

Thanks for your prompt response, however based on your submission, theoretically speaking it is better to stick to the rules of analysing ordinal data. My worry is that I might not be able to present my report if analysed with mean and standard deviation as well as when I am using percentage and frequency. In conclusion, what are my going to loose if I stick with my frequency and percentage.

Dear Achilleas
You are absolutely amazing you have already saved so much of my time with some of your answers already.
I have one more small question left though, I had an employee engagement survey contained 12 factors and each factor was involving 3 items(over all 36 items). they were measured on 1 to 6-point scale. I am conducting a regression analysis and I would like to know how to find the value for the each factor. For instance, Vigor was measured with 3 items i.e: I feel strong and vigorous when I’m studying (Vigor1, vigor2 and vigor3) and what is the best way to calculate the value for Vigor.
Thank you so much for your time.
Kind regards

Hi
I have gathered information on a 7-point Likert Scale. Now, I like to add few questions under one of the sub category. When I did that by using average, I get results like 2.2, 3.6, 4.21 etc.., moreover descriptive results of such sub category does not give results against 1-7 point satisfaction scale.
I would be grateful if any one help me in this context.

Hello Iftikhar. I have taken the liberty to edit the all caps in your question, which some people might construe as a sign of rudeness or, at the very least, indifference to the readers. I cannot help you with your question, because I do not understand what you are asking. Sorry.

Hello Achilleas…
please help me to find solution which i have faced while doing my research about impact on employee engagement on turnover intention. There are positive and negative statements in my questionnaire under turnover intention variable which measure responces using five point Likert scale..
How could i use these data when analyzing using spss

[…] This post has been prompted by an edited collection that I was recently asked to review. Substantive comments on the book will be published elsewhere, so you may want to watch this space for update… […]

My purpose of using a 5-point likert scale (5=strongly agree, 4=agree, 3=not sure, 2=disagree, 1=strongly disagree) is to know which students have high motivation and which have low motivation. The scale described that the higher the score, the higher the motivation is. It consists of 30 items.

If I follow the nature of the 5 point likert scale, the range for qualitative interpretation is…

1.00-1.80 very low
1.81-2.60 low
2.61-3.40 average
3.41-4.20 high
4.21-5.00 very high

Since I used an established inventory, I followed the 5-point likert scale instead of changing it to a 4-point likert scale which could have made this easier since i would not have a middle or average range.

For my study, I am more interested in classifying the students into 2 groups and not 5 groups. HERE IS MY QUESTION: Would it be best to just follow the 5 groupings or is it ok if I make it 2 groups with ranges 1.00-3.00 as low and 3.01-5.00 as high??

My study only involves 14 students and this step of classifying them into 2 groups based on their level of motivation is just the first step in my data analysis. Thanks

If you are using an established inventory, it should come with instructions suggesting possible ways to interpret the data. Prima facie, I see no statistical reason why not to use two groupings rather than five, if it makes more sense in your study. The obvious downside is that it will make it harder to compare results with other researchers who have used the scale.

My main concern is that I am not at all sure you should be calculating a mean. My take is that these are ordinal data, and should not be subjected to this kind of analysis.

It is sad how you misinform people. First, calculating the mean of ordinal variables is not as problematic as you think, you can even run parametric tests on ordinal likert scales. Please check the literature (one place to start with some nice references is Norman 2010). Second, the research question by nazir Bano is perfectly valid. If you want to study the effect in real life I would use mixed efffects modelling and data from standardised tests (like PISA) combined with observations and self-reports by the teacher..

You may find that I have referred to Norman (2010) elsewhere in this blog, and while I think it is a reasonably well-argued line of thinking, I am not sure how it contradicts what I have written here. Norman argues that parametric procedures are robust enough to withstand violations of their assumptions. This does not mean that such procedures are optimal practice, and Norman does not make this argument anyway.

As regards the second part of your comment, you seem to be missing that the question has no operational definitions of ‘effectiveness’ or ‘method’, it assumes that methods are applied consistently across time, and presupposes somehow controlling for any learning that takes place outside the classroom, among other problems. If you think you can answer such a question in any meaningful way, I suggest that you put your money where your mouth is and answer it.

No, calculating the mean in this ordinal scale is statistically wrong. You can use the median and mode, if you want a measure of central tendency; you can also present frequency distributions for every response.

Hi, Sir. I did not have any conflict understanding your discussion about Likert Scale. I just want to leave this comment for you are cool- that you answer the queries of my co-readers and co-researchers. Thank you, Sir. May God bless you and may you help more more people then on.

Hi Achelleas,
Thank you for your tutorial on calculating mean of ordinal data. I was looking for literature to support my advice to my student (that the mean score and standard deviation he calculated for a Likert scale had no meaning in the context of his study) when I came across your blog.
He surveyed 192 patients concerning quality of hospital service. The responses were: Strongly agree = 42; Agree = 104; Neutral =32; Disagree = 14, Strongly Disagree = 0. He calculated the mean of responses as 2.09 and Standard deviation as 0.82.
Obviously, the mean and SD makes no meaning in interpreting the results. AS suggested in your blogs, I will advise the student to use median and mode to explain the central tendency of the responses.

Hi! You have stated that “The mathematical model needs these assumptions in order to work, but they are simply not in the questionnaire design. And even if we forced them into the questionnaire, that would constitute a gross distortion of psychological attitudes and the social world to fit our statistical mould.” Can you kindly elaborate on this or direct me to the appropriate resources on which these statements are based? Thanks.

You’d only do mean score ranking if you thought that the anchor points are equidistant, i.e., something ‘poor’ is 20% the quality of something ‘excellent’. If you make this assumption, and I don’t think you should make it, then you can use the formula described in the post, which produces a mean value of 3.94

im grouping these statements according to 4 hypotheses. What is the best way to analyze the data ? shall i juste descrive the distribution of each category by combining SA/agree VS SD/D and i don’t know for each items or i should group the statements for every hypothesis together and calculate how many people agree with all the these questions ? etc

Yes, I understand that, but it is still unclear what you are trying to do: Are you trying to describe a population or test whether something is true? It’s difficult to be specific without knowing more about your research questions or sample, and I really think your supervisor should be able to give you better advice than I can.

Very broadly, if you’re trying to confirm/disprove a hypothesis, then you need to crosscheck how the responses to the Likert items (dependent variables) map out against the other variables like age, field of study and year of study (independent variables). You can do this by grouping your items into one or more multi-item scales (assuming they are cohesive enough) and running a t-test. Alternatively you could run a cross-tab and chi square test with your independent variables and each item.

Im trying to find out if the atttitudes of the respondents is positive ort negative and if the three variables i mentioned have an effect on that. What do you mean by cohesive enough.. I tried the alpha test for internal reliability but for some sub scales i get an alpha <0.7

That’s unfortunate, as it means that the scale items do not really measure the same thing. Perhaps you can try again, after you remove selected items from the scales.

You could condense the responses to ‘positive’/’negative’ and do item-by-item cross-tabs, but I suspect that the low number of respondents and the large number of variables will prevent you from getting anything really conclusive.

Kind of, depending on what statistical procedures you plan to do. If you do a cross tabs and chi-square with multiple values for each variable, it may throw your statistics off. In that case you’re probably better off combining values (not variables!) If you’re just doing descriptives, it should be ok.

I am sorry i did not understand you very well here. what do you mean by combining values ?

What I ended up doing is normal descriptive analysis for each statement by showing the frequency of every answer for each statement. When commenting on the results i combine SA with A Vs SD,D and i have a third category neutral. Ill treat every variable seperately for eaach statement.

I am currently drafting my graduate thesis on knowledge management (KM) and climate resilience (KM). The main objective of my research is to look at the relationship between these two concepts/constructs.

I have identified four variables for KM: gathering, storing, sharing, use; and 3 variables for resilience: buffer capacity, self-org, and learning capacity.

Hi! Scales (1) and (2) are ordinal, because you can meaningfully place the values in a row from ‘least’ to ‘most’.

The final scale seems different: I assume that a drought is not ‘more’ or ‘less’ of a risk than an earthquake, just different. These scales are called ‘nominal’ or ‘categorical’. In this case, your analysis options are more limited: you can only use frequency counts and the mode as a measure of central tendency.

Thank you very much for the prompt response, sir. I just have a follow-up question, is it okay to have 0 in the scale? I read from sources that having 0 means “missing data”, if this is so, can I use 1 instead of 0 for “no formal education”?

And should the scales for both KM and climate resilience be equal, like should they both have a 5-point scale or can I use 5-point for KM, then 6-point for climate resilience statement?

I am having 355 sample size and want to compare the reasons of visit to a fast food outlet. I have taken 10 reasons e.g. during travelling, for dinner, for spending time with friends and family etc. I have collected the data on a 5 point likert scale with 1 as never, 2 as rarely, 3 as sometimes, 4 as often and 5 as very often. what statistical tool I should apply ? I should compare means or mean ranks of the 10 reason to visit a fast food outlet.

I’m afraid that I cannot help you with this question because you have not told me what your research question is, i.e., what you’re trying to find out. Your supervisor may be a more appropriate person to seek advice from. They may also be able to advise you with the pragmatics or asking for assistance. Good luck with your project.