Ringo provides a range of functions apart from making recommendations.
For example, when rating an artist or album, a person can also write a
short review, which Ringo stores. Two actual reviews entered by users
are shown in Figure 4. Notice that the authors of
these reviews are free to decide whether to sign these reviews or keep
them anonymous. When a user is told to try or to avoid an artist, any
reviews for that artist written by similar users are provided as well.
Thus, rather than one ``thumbs-up, thumbs-down'' review being given to
the entire audience, each user receives personalized reviews from
people that have similar taste.

Tori Amos has my vote for the best artist ever. Her lyrics and music
are very inspiring and thought provoking. Her music is perfect for
almost any mood. Her beautiful mastery of the piano comes from her
playing since she was two years old. But, her wonderful piano
arrangements are accompanied by her angelic yet seductive voice. If you
don't have either of her two albums, I would very strongly suggest
that you go, no better yet, run down and pick them up. They have been
a big part of my life and they can do the same for others. ---- user@place.edu}

I'd rather dive into a pool of dull razor blades than listen to Yoko
Ono sing. OK, I'm exaggerating. But her voice is *awful* She ought to
put a band together with Linda McCartney. Two Beatles wives with
little musical talent.

Figure 4:
Two sample reviews written by users.

In addition, Ringo offers other miscellaneous features which increase
the appeal of the system. Users may add new artists and
albums into the database. This feature was responsible for the
growth of the database from 575 artists at inception to over 2500
artists in the first 6 weeks of use. Ringo,
upon request, provides dossiers on any artist. The dossier includes a
list of that artist's albums and straight averages of scores given that
artist and the artist's albums. It also includes any added history
about the artist, which can be submitted by any user. Users can also
view a ``Top 30'' and ``Bottom 30'' list of the most highly and most poorly
rated artists, on average. Finally, users can subscribe to a periodic
newsletter keeping them up to date on changes and developments in
Ringo.

ALGORITHMS AND QUANTITATIVE RESULTS

Ringo became available to the Internet public July 1, 1994. The
service was originally advertised on only four specialized USENET
newsgroups. After a slow start, the number of people using Ringo grew
quickly. Word of the service spread rapidly as people told their
friends, or sent messages to mailing lists. Ringo reached the
1000-user mark in less than a month, and had 1900 users after 7 weeks.
At the time of this writing (September 1994) Ringo has 2100 users and
processes almost 500 messages a day.

Like the membership, the size of the database grew quickly.
Originally, Ringo had only 575 artists in its database. As we soon
discovered, users were eager to add artists and albums to
the system. At the time of this writing, there are over 3000 artists
and 9000 albums in Ringo's database.

Thanks to this overwhelming user interest, we have an
enormous amount of data on which to test various social information information
filtering algorithms. This section discusses four algorithms that were
evaluated and gives more details about the ``winning'' algorithm.
For our tests, the profiles of 1000 people were considered. A profile
is a sparse vector of the user's ratings for artists. 1,876 different
artists were represented in these profiles.

To test the different algorithms, 20 % of the ratings
in each person's profile were then randomly removed. These ratings
comprised the target set of profiles. The remaining 80 % formed
the source set. To evaluate each algorithm, we predicted a value
for each rating in the target set, using only the data in the source
set. Three such target sets and data sets were randomly created and
tested, to check for consistency in our results. For brevity, the
results from the first set are presented throughout this paper, as
results from all three sets only differed slightly.

In the source set, each person rated on average 106 artists of the
1,876 possible. The median number of ratings was 75, and the most
ratings by a single person was 772! The mean score of each profile,
i.e. the average score given all artists by a user, was 3.7.

Evaluation Criteria

The following criteria were used to evaluate each prediction scheme:

The mean absolute error of each predicted rating must be minimized. If

The lower the mean absolute error, the more accurate the scheme. We
cannot expect to lower
| E | (as mean with bar over E)
below the error in people's
ratings of artists. If one provides the same list of artists to a
person at different points of time, the resulting data collected will
differ to some degree. The degree of this error has not yet been
measured. However we would expect the error to at least be $\pm 1$
unit on the rating scale (because otherwise there would be 0 or no
error).

The standard deviation of the errors,
should also be minimized. The lower the deviation, the more
consistently accurate the scheme is.

Finally, T, the percentage of target
values for which the scheme is able to compute predictions should be
maximized. Some algorithms may not be able to make predictions in all
cases.

Base Case Algorithm

A point of comparison is needed in order to measure the quality of
social information filtering schemes in general. As a base case, for each artist
in the target set, the mean score received by an artist in the source
set is used as the predicted score for that artist. A social information filtering
algorithm is neither personalized nor accurate unless it is a
significant improvement over this base case approach.

FIGURE 5 depicts the distribution of the errors, E.
| E | (as mean with bar over E)
is 1.3, and the standard deviation
o (sigma)
is 1.6.
The distribution has a nice bell curve shape about 0, which is what was
desired. At first glance, it may seem that this mindless scheme does
not behave too poorly. However, let us now restrict our examination to
the
extreme target values, where the score is
6 or greater
or
2 or less. These values, after all, are the critical points.
Users are most interested in suggestions of items they would love or
hate, not of items about which they would be ambivalent.

Figure 5: The distribution of errors in predictions of the Base
Algorithm.

The distribution of errors for extreme values is shown by the dark
gray bars in Figure 5. The mean error and standard
deviation worsen considerably, with
| E | (as mean with bar over E) = 1.8 and o (sigma) = 2.0$.
Note the lack of the desired bell curve shape. It is in fact
the sum of two bell curves. The right hill is mainly the errors for
those target values which are
2 or less. The left hill is mainly
the errors for those target values which are
6 or greater.

For the target values 6 or greater, the mean absolute error is
much worse, with
| E | (as mean with bar over E) = 2.1.
Why the great discrepancy in
error characteristics between all values and only extreme values?
Analysis of the database indicates that the mean score for each artist
converges to approximately
4. Therefore, this scheme performs
well in cases where the target value is near
4. However, for
the areas of primary interest to users, the base algorithm is useless.

Social Information Filtering Algorithms

Four different social information filtering algorithms were evaluated. Due to
space limitations, the algorithms are described here briefly. Exact
mathematical descriptions as well as more detailed analysis of the
algorithms can be found in [ 7 ].
.

The Mean Squared Differences Algorithm. The first algorithm
measures the degree of dissimilarity between two user
profiles,
Ux (U subscript x) and Uy (U subscript y)
by the mean squared difference
between the two profiles:

Predictions can then be made by considering
all users with a dissimilarity to the user which is less than a
certain threshold L and computing a weighted average of the ratings
provided by these most similar users, where the weights are inverse
proportional to the dissimilarity.

The Pearson Algorithm. An alternative approach is to use the
standard Pearson r correlation coefficient to measure similarity
between user profiles:

This coefficient ranges from -1,
indicating a negative correlation, via O, indicating no correlation,
to +1 indicating a positive correlation between two users. Again,
predictions can be made by computing a weighted average of other
user's ratings, where the Pearson
r
coefficients are used as the
weights. In contrast with the previous algorithm, this algorithm makes
use of negative correlations as well as positive correlations to make
predictions.

The Constrained Pearson
r
Algorithm. Close inspection of the
Pearson r algorithm and the coefficients it produced prompted us to
test a variant which takes the positivity and negativity
of ratings into account. Since the scale of ratings is absolute, we
``know'' that values below
4 are negative, while values above
4 are positive. We modified the Pearson
r scheme so that only
when there is an instance where both people have rated an artist
positively, above 4 , or both negatively, below 4, will the
correlation coefficient increase. More specifically, the standard
Pearson
r equation was altered to become:

To produce recommendations to a user, the constrained Pearson
r
algorithm first computes the correlation coefficient between the user
and all other users. Then all users whose coefficient is greater than
a certain threshold
L are identified. Finally a weighted average of
the ratings of those similar users is computed, where the weight is
proportional to the coefficient. This algorithm does not make use of
negative ``correlations'' as the Pearson
r algorithm does. Analysis
of the constrained Pearson
r coefficients showed that there are few
very negative coefficients, so including them makes little difference.

The Artist-Artist Algorithm . The preceding algorithms deal with
measuring and employing similarities between users.
Alternatively, one can employ the use of correlations between
artists or albums to generate predictions. The idea is simply an
inversion of the previous three methodologies. Say Ringo needs to
predict how a user, Murray, will like ``Harry Connick, Jr''. Ringo
examines the artists that Murray has already rated. It weighs each
one with respect to their degree of correlation with ``Harry Connick,
Jr''. The predicted rating is then simply a weighted average of the
artists that Murray has already scored. An implementation of such a
scheme using the constrained Pearson
r correlation coefficient was
evaluated.

Results

A summary of some of our results (for different values of the
threshold L) are presented in table \ref{tab:table2}. More details
can be found in [7]. Overall, in terms of accuracy and the
percentage of target values which can be predicted, the constrained
Pearson r algorithm performed the best on our dataset if we take
into account the error as well as the number of target values that can
be predicted. The mean square differences and artist-artist algorithms
may perform slightly better in terms of the quality of the predictions
made, but they are not able to produce as many predictions.

Table 1: Summary of Results.

As expected, there is a tradeoff between the average
error of the predictions and the percentage of target values that can
be predicted. This tradeoff is controlled by the parameter
L, the
minimum degree of similarity between users that is required for one
user to influence the recommendations made to another.

FIGURE 6 illustrates the distribution of errors for
the best algorithm with the threshold
L equal to 0.6. The
distribution for extreme values approaches a bell curve,
as desired. The statistics for all values and extreme values are
| E | (as mean with bar over E) = 1.1,
o (sigma) = 1.4 and
| E | (as mean with bar over E) = 1.2,
o (sigma) = 1.6, respectively. These results are quite excellent,
especially as the mean absolute error for extreme values approaches
that of all values. At this threshold level, 91 % of the target set
is predictable.

QUALITATIVE RESULTS

Ultimately, what is more important than the numbers in the previous
section is the human response to this new technology. As of this
writing over 2000 people have used Ringo. Our source for a qualitative
judgment of Ringo is the users themselves. The Ringo system operators
have received a staggering amount of mail from users---
questions, comments, and bug reports. The results described in
this section are all based on user feedback and observed use
patterns.

One observation is that a social information filtering system becomes more
competent as the number of users in the system increases.
FIGURE 7 illustrates how the error in a
recommendation relates to the number of people profiles consulted to
make the recommendation. As the number of user scores used to
generate a prediction increases, the deviation in error decreases
significantly. This is the case because the more people use the
system, the greater the chances are of finding close matches for any
particular user. The system may need to reach a certain {\em
critical mass} of collected data before it becomes useful. Ringo's
competence develops over time, as more people use the system.
Understandably then, in the first couple weeks of Ringo's life, Ringo
was relatively incompetent. During these days we received many
messages letting us know how poorly Ringo performed. Slowly, the
feedback changed. More and more often we received mail about how
"unnervingly accurate'' Ringo was, and less about how it was
incorrect. Ringo's growing group of regular "customers'' indicates
that it is now at a point where the majority of people find the service
useful.

Figure 7 Caption.

However, many people are disappointed by Ringo's initial
performance. We are often told that a person must do one or two
iterations of rating artists before Ringo becomes accurate. A user
would rate the initial set, then receive predictions. If the user
knows any of the predicted artists are not representative of their
personal taste, they rate those artists. This will radically alter
the members of the user's ``similar user'' neighborhood. After
these iterations, Ringo works satisfactorily. This indicates
that what is needed is better algorithm for determining the
``critical'' artists to be rated by the user so as to distinguish the
user's tastes and narrow down the group of similar users.

Beyond the recommendations, there are other factors which are
responsible for Ringo's great appeal and phenomenal growth. The
additional features, such as being a user-grown database, and the
provisions for reviews and dossiers add to its functionality.
Foremost, however, is the fact that Ringo is not a static system. The
database and user base is continually growing. As it does, Ringo's
recommendations to the user changes. For this reason, people enjoy
Ringo and use it on a regular basis.

RELATED WORK

Several other attempts have been made at building filtering services
that rely on patterns among multiple users. The Tapestry system
\cite{tapistry} makes it possible to request Netnews documents that have been
approved by other users. However, users must themselves know who
these similar people are and specifically request documents annotated
by those people. That is, using the Tapestry system the user still
needs to know which other people may have similar tastes. Thus, the
social information filtering is still left to the user.

During the development of Ringo, we learned about the existence of
similar projects in a similar state of development. One such example
is Grouplens [4], a system applying social information filtering to
the personalized selection of Netnews. GroupLens employs Pearson
r
correlation coefficients to determine similarity between users. On
our dataset, the algorithms described in this paper performed better
than the algorithm used by Grouplens.

Two other recently developed systems are a video recommendation
service implemented at Bellcore, Morristown, NJ and a movie
recommendation system developed at ICSI, Berkeley, CA. Unfortunately,
as of this writing, there is no information available about the
algorithms used in these systems, nor about the results obtained.

The user modeling community has spawned a range of recommendation
systems which use information about a user to assign that user to one
of a finite set of hand-built, predefined user classes or
stereotypes. Based on the stereotype the user belongs to, the system
then makes recommendations to the user. For example [5]
recommends novels to users based on a stereotype classification. This
method is far less personalized than the social information filtering
method described in this paper. The reason is that in social
information filtering, in a sense every user defines a stereotype that
another user can to some degree belong to. The number of stereotypes
which is used to define the user's taste is much larger.

Finally, some commercial software packages exist that make
recommendations to users. An example is Movie Select, a movie
recommendation software package by Paramount Interactive Inc. One
important difference is that these systems use a data set that does
not change over time. Furthermore, these systems also do not record
any history of a person's past use. As far as can be deduced from the
software manuals and brochures, these systems store correlations
between different items and use those correlations to make
recommendations. As such the recommendations made are less
personalized than in social information filtering systems.