This article is in need of attention from a psychologist/academic expert on the subject.Please help recruit one, or improve this page yourself if you are qualified.This banner appears on articles that are weak and whose contents should be approached with academic caution

.

A rating scale is a set of categories designed to elicit information about a quantitative attribute in social science. Common examples are the Likert scale and 1-10 rating scales for which a person selects the number which is considered to reflect the perceived quality of a product.

Contents

Background

In Psychometrics, rating scales are often referenced to a statement which expresses an attitude or perception toward something. The most common example of such a rating scale is the Likert scale, in which a person is asked to select a category label from a list indicating the extent of disagreement or agreement with a statement.

The basic feature of any rating scale is that it consists of a number of categories. These are usually assigned integers. For example, an example of the use of a Likert scale is as follows.

Statement: I could not live without my iPod.

Response options:

1. Strongly Disagree

2. Disagree

3. Agree

4. Strongly Agree

It is common to treat the numbers obtained from a rating scale directly as measurements by calculating averages, or more generally any arithmetic operations. Doing so is not however justified. In terms of the levels of measurement proposed by S.S. Stevens, the data are ordinal categorisations. This means, for example, that to agree strongly with the above statement implies a more favourable perception of iPods than does to agree with the statement. However, the numbers are not interval-level measurements in Stevens' schema, which means that equal differences do not represent equal intervals between the degree to which one values iPods. For example, the difference between strong agreement and agreement is not necessarily the same as the difference between disagreement and agreement. Strictly, even demonstrating that categories are ordinal requires empirical evidence based on patterns of responses (Andrich, 1978).

More than one rating scale is required to measure an attitude or perception due to the requirement for statistical comparisons between the categories in the polytomous Rasch model for ordered categories (Andrich, 1978). In terms of Classical test theory, more than one question is required to obtain an index of internal reliability such as Cronbach's alpha (Cronbach, 1951) which is a basic criterion for assessing the effectiveness of a rating scale and, more generally, a psychometric instrument.

In almost all cases, online rating scales only allow one rating per user per product, though there are exceptions such as Ratings.net, which allows users to rate products in relation to several qualities. Most online rating facilities also provide few or no qualitative descriptions of the rating categories, although again there are exceptions such as Yahoo! Movies which labels each of the categories between F and A+ and BoardGameGeek, which provides explicit descriptions of each category from 1 to 10. Often, only the top and bottom category is described, such as on IMDb's online rating facility.

With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent).

Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e., they represent only the conglomeration of those inclined to submit ratings.

Sampling is one factor which can lead to results which have a specific bias or are only relevant to a specific subgroup. To illustrate the importance of such factors, consider an example. Suppose that a film's marketing strategy and reputation is such that 90% of its audience are attracted to the particular kind of film; i.e. it does not appeal to a broad audience. Suppose also that the film is very popular among the audience that does see the film and, in addition, that those who feel most strongly about the film are inclined to rate the film online. This combination may lead to very high ratings of the film which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).

Qualitative description of categories is an important feature of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely whereas other may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points. Even with category descriptions, some may be harsher raters than others. Rater harshness is also a consideration in marking essays in educational contexts. [1].

These issues are also compounded when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key problems with aggregate data based on the kinds of rating scales commonly used online are as follow:

Averages should not be calculated for data of the kind collected.

It is usually impossible to evaluate the reliability or validity of user ratings.

Products are not compared with respect to explicit, let alone common, criteria.

Only users inclined to submit a rating for a product do so.

Data are not usually not published in a form that permits evaluation of the product ratings.