A Search for the optimum feedback scale

At Getfeedback our responsibility to provide users with the most effective measurement tools requires us to keep abreast of the latest research indicating the best survey response scales to implement. This article is written in response to the many queries received from HR professionals whose duty it is to develop appropriate employee feedback or 360-degree feedback surveys for their organisations. In order to determine which are the best scales for these particular purposes, research evidence proffered in support of the various survey scales must be surveyed to reveal the definitive response scale.

It is impossible to recommend a single perfect response scale, as the optimum scale must be moulded to meet the survey requirements. When defining a survey scale to meet specific requirements there are a number of decisions to be made such as to whether the scale is odd or even numbered, the number of points on the scale, the labelling of the points, as well as the general reliability and validity of the scale.

There is much debate as to whether even or odd numbered scales are the most effective. Odd numbered scales are generally regarded as allowing for a ‘neutral’ option such as ‘neither agree or disagree,’ or ‘don’t care.’ Supporters of the neutral point argue that giving a ‘don’t know’ option ensures that respondents do not manufacture opinions instantaneously. However, advocates of even numbered scales argue that in reality people are never neutral on issues and always have an opinion, even if they had not previously conceived of it. Moser and Kalton (1972:344) [1] argue that ‘there is clearly a risk in suggesting a non-committal answer to the respondent,’ as they believe that a mid-point allows respondents to ‘opt out’ which in turn provides uninformative data. The advantage of a scale which forces a view is that when it is pooled with all other responses it provides a much needed benchmark.

The problem with ‘forced choice’ scales is that they tend to positively skew the overall results. In a forced choice situation respondents prefer to ‘be nice,’ rating items positively rather than negatively, i.e. choosing ‘somewhat agree’ rather than ‘somewhat disagree.’ It is possible however to dispel this ‘Halo effect’ (where respondents with an overall feeling of like or dislike give high or low ratings to all features) by alternating the direction of successive ratings. There is a second method of reducing positive bias by wording points in a positive way so that criticism seems less negative which is discussed in greater detail below.

Of equal importance as the problem of positive response bias is the ethical concern forced choice responses raise. Researchers have contended that it is somewhat unethical to force responses in such a way. Rugg and Cantril (1944:33)[1] argued for the middle alternative ‘in that it provides for an additional graduation of opinion.’ Indeed surveys with a neutral option have been found to merit a higher response rate, possibly indicating that respondents feel more comfortable using them. Ultimately, whether even or odd number scales are used depends on the requirements of the research, for example if it is necessary to delineate between satisfied and dissatisfied customers then an even number scale would be more suitable as it would show whether responses are largely positive or negative.

In addition to the above debate, there is also a methodological problem concerning the central point in a Likert-type scale in that it can be ambiguous. It may imply a neutral position, i.e. that the respondent has no opinion, or it may be that the respondent is torn between feelings in both directions. Partly as a consequence of this overall scores central to the distribution are quite ambiguous. Central scores could be composed of a large number of ‘undecided’ answers or they could be culmination of ‘strongly for’ and ‘strongly against’ answers. Also, analysis of results, which include a mid-point, may not expose the fact that respondents were answering in a ‘devil may care’ fashion.

A second decision to make in defining an optimal response scale is the number of points to use. The number of response options affects the scales’ reliability (the ability to provide the same feedback regardless which sample of the population you select) and discriminability (‘the ability to discriminate between degrees of the respondents’ perceptions of an item’[2] ). Cohen (1983)[2] concluded that a minimum of three points is necessary whilst a maximum of nine points can be used effectively (Bass, Cascio & O’Connor, 1974)[2].

Ten point (or more) scales tend to be employed less frequently as it is usually difficult to make distinctions finer than a 10-point scale requires, notwithstanding the fact that the larger the number of choices offered the more complicated it is for respondents to utilise. Although a higher number of points may seem to gather more discriminating data, there is some debate as to whether respondents actually discriminate carefully enough to make these scales valuable. Overall, the extreme categories are found to be under-used. It is, however, possible to counteract this by making end points sound less extreme or (particularly for a 10-point scale) by pooling responses from a group of end categories.

It is common to find that 10-point scales are condensed into three or five point scales for reporting purposes. Thus it would seem simpler to utilise four or five point scales, especially as a score of five fits neatly with the five statements on the semantic scale which ranges from ‘very good’ to ‘very poor,’ it yields a good distribution of response and enables researchers to easily pick out differences in opinion.’ On the other end of the spectrum, two and three point scales have little discriminative value and are therefore rarely recommended for satisfaction research.

Chang’s (1994)[2] overview of previous research discovered that various conclusions had been drawn concerning the issue of reliability: that reliability is independent of the number of points on the scale, that it is maximised by a 7-... or 5-... or 4-... or 3-point scale! In terms of reliability Chang argues that there are two issues to consider: respondent knowledge of the subject and the similarity of their frame of reference. He believes that the higher the number of response options available, the greater the likelihood of error is as respondents’ frames of reference are likely to differ on the various meanings of the points. There are almost always problems of defining the end points of scales relating to, for instance, ‘honesty’ as different respondents may use different frames of reference unless they are informed of the purpose of the rating procedure. Similarly, users of 360-degree feedback will have different views of what constitutes ‘excellent’ behaviour depending on the individual’s role within the organisation.In the same way, an Olympic Athlete will have a better idea of using setting stretching performance targets than most employees because Olympic Performance is dependent upon being the best.

It is often the case in 360-degree and employee feedback that respondents do not have the necessary information to comment on other people’s behaviour, so a ‘not able to rate/insufficient evidence’ category must be included. Chang suggests that if respondents lack knowledge about the subject being surveyed then they will overuse the end points of a longer scale.

Research into a 3-point scale (consisting of ‘strength,’ ‘adequate,’ and ‘development needed’) has been undertaken at Getfeedback. Feedback from respondents indicated that they felt heavily constrained with only three categories to choose from and following the positive response bias, were reluctant to use ‘development needed’ on more than a few occasions. The 3-point scale was employed on a questionnaire measuring 5 competences with 7 questions for each competence. Results showed a high degree of positive skew, 65% of responses were ‘adequate,’ 25% were ‘strength’ whilst only 10% of responses were ‘development needed.’ Clearly this exemplifies the power of label descriptions, the tendency to respond positively in this instance causes the 3-point scale to be an inadequate measuring tool.

Further research at Getfeedback has shown that highly detailed descriptions of response options are more effective than more indiscriminate ones. For example, a scale such as:

5 Consistently exhibits exceptional behaviour. Is an inspiration to colleagues.

4 Aloways exhibits behaviour and is at times exceptional

3 Usually exhibits behaviour with an effective outcome

2 Sometimes exhibits behaviour effectively — development would improve consistency of the behaviour

1 Rarely/never exhibits behaviour — significant development required
n/a not able to rate is more useful to respondents than 1- True, 2- Inclined to be True, 3- Inclined to be false,

4-False.

The labels of each point can influence the reliability and discriminability of a scale. For example, definitions can be written to offer more positive than negative options resulting in skewed data, so care must be taken to avoid bias of this kind. Problems which arise from differing frames of reference can be reduced by using only two labels to anchor the end points, resulting in a nominal scale with an equal number of intervals (represented by digits) between labelled end points. While this may in fact be the case, it of greater importance to ensure that respondents understand the meaning of each point in the scale which seems only possible if each point in the scale is labelled. Not only does this avoid misinterpretations of numerical points but it also allows the report to be written in concrete pre-determined terms.

Labels affect the validity of the survey regardless of the number of labels employed. In order to determine the validity of a scale (whether or not the questions are relevant to what is being tested and the objectives of your survey) the survey must be tested and re-tested. A well-established scale that has been in use for many years will provide more valid data. Thus when defining an optimum response scale it is important to consider scales which are frequently used. The Mayflower organisation2, which regularly implements surveys, has recommended four five-point scales found to be especially effective:

1) Far too much, too much, about right, too little, far too little.

2) Much higher, higher, about the same, lower, much lower.

3) One of the best, above average, average, below average, one of the worst.

4) Very good, good, fair, poor, very poor.

In ‘Choosing the Right Scale[3] ’ (1995) NCS Pearson Inc. recommend two scales for the measurement of requirement and expectation (used in Mail surveys). The four-point requirement scale is as follows:

Exceeded

Met

Nearly Met

Missed

4

3

2

1

This is recommended for reliability and discriminability and is particularly suitable for unsatisfied respondents who prefer to utilise more positive terms such as the ‘nearly met’ response. Similarly reliable and discriminate is the five-point expectations scale:

Significantly
Above

Above

Met

Below

Significantly
Below

5

4

3

2

1

In determining which scale is best for 360-degree and employee feedback surveys the important issues to consider are

Reliability: Lissitz and Green (1975) suggest that reliability starts to level off after 5 points therefore likert type scales are the most reliable.

Discriminability, again Likert type scales are highly recommended as they offer enough information to discriminate between participants’ differing viewpoints.

Validity: evidence has shown that the most valid scales are those that have been employed effectively for some time.

The even vs. odd numbered scales debate: evidence indicates that ‘forced choice’ scales are really only suitable in certain circumstances (such as customer satisfaction surveys). It important to have an equal number of positive and negative points to choose from as well as a neutral option for respondents to select if they do not feel that they can make an informed decision.

Labelling: all points should be labelled to avoid confusion and minimise error due to differing frames of reference. Also the power of labels must not be underestimated; concise detail and positive wording (i.e. ‘nearly met’ instead of ‘poor’) are tantamount to the success of the survey.

Finally, although a 5-point Likert type scale seems the optimum response scale for 360-degree and employee feedback, whatever scale is employed must relate to what is being surveyed and be as free from bias as possible. Only when all of these points have been considered will the end result be the best response scale for your survey.