Search

Consumer Rankings

On-line booksellers have changed the way
I decide which books to buy. Not only have they made it possible
for me to get most books within two to three days, but they have
also given me a means of comparison shopping previously unavailable
to me. If I am thinking of buying a book, I immediately look to see
what others have written about it. In most cases, these evaluations
clinch the deal, convincing me to buy a certain book instead of its
competition.

I have been thinking more about such systems since my recent
move to Modi'in, a new Israeli city halfway between Jerusalem and
Tel Aviv. Just before I moved to Modi'in, I was asked to take over
a small mailing list for residents of Modi'in. The list gives
people a chance to share local announcements and ideas related to
life in Modi'in.

Soon after becoming the list administrator, I realized that
subscribers often asked for recommendations, from doctors to lawn
services to after-school activities. Normally, an e-mail list
administrator who sees such constant repetitions will prepare a
FAQ, a list of common questions and their answers. But
recommendations are extremely subjective, and one person's favorite
barber may be someone else's nightmare.

This month, we will look at a set of CGI programs I wrote to
allow list subscribers to enter and rank their favorite products
and services in town. Because my web space provider does not offer
mod_perl, I had to use the CGI
standard for writing my programs.

This “ranking” system, as I call it, consists of three CGI
programs written in Perl, which use a relational database for
back-end storage. (These listings are not printed here due to space
considerations, but can be downloaded from the
LJ FTP site. See Resources.) I have used
MySQL, but there is no reason another relational database, such as
PostgreSQL or Oracle, could not be substituted for it. Some of the
SQL syntax might have to be changed in order to fit another
database server, but the majority should remain the same.

While the ranking system presented this month is not as
sophisticated as the one used by Amazon.com nor as flexible as that
used by Epinions.com, it does serve a simple purpose. Moreover, it
demonstrates how to produce a simple ranking system, which could
easily be extended to produce a rough version of
Epinions.com.

Creating the Tables

As always, the first step in creating a database/web
application is to consider how we want to store the information,
and then to create the tables in our relational database. In this
particular case, we will keep things simple, dividing ranked items
into categories, but without any hierarchy representing the
categories. We will thus be able to keep all restaurants in the
same category, but without any distinction between Italian
restaurants and French restaurants. Alternatively, we can create
two separate categories for Italian and French restaurants, but
then the system will see them as unrelated as barbers and
vacuum-cleaner salesmen.

We will also associate a name and e-mail address with each
ranking. It might be more elegant to place user names in a separate
table and refer to them with a numeric key. However, we are less
interested in tracking users than in making it possible to find
useful consumer information.

Given all this, I decided to implement the ranking system
with three tables: RankCategories, RankItems and Rankings.
RankCategories, as its name implies, contains the categories into
which the items are sorted, and can be defined as follows:

Notice how each category will have a unique
category_id, allowing us to refer to it by
number alone. By using the AUTO_INCREMENT
feature, MySQL can automatically set this number for us, ensuring
that it is unique. For a small web site, a
MEDIUMINT is large enough, ranging from 0 to
16,777,215. A large site might eventually have more than 17 million
rankings, in which case a larger size, such as
INT or BIGINT, might be a
good idea.

We also want to ensure that no two categories can have the
same name, and thus add a unique constraint to
the category_name column as well. Each category
can then have some descriptive text associated with it, which is
placed in category_description.

Once again, each item has a unique value in
item_id, and the item name is guaranteed to be
unique, thanks to the UNIQUE constraint.
However, RankItems adds a category_id column,
identifying the category in which this item sits.

RankItems defines the items available to be ranked, but does
not store the grades. That role is assigned to the Rankings table,
defined as follows:

Like the other tables, Rankings gives each ranking a unique
primary key, ranking_id. None of our
applications use ranking_id, and it is possible
that its inclusion in the table definition wastes some space on
disk and in memory. However, such a primary key will make it easier
to refer to items in Rankings if and when we build additional
applications for the ranking system.

Rankings then contains an item_id column,
which refers back to the primary key of the same name in RankItems.
Because each item is in a single category and RankItems contains a
category_id column, there is no need to name the
category in Rankings as well.

We ask each ranking user to provide his or her full name and
e-mail address. This information is displayed next to a ranking, in
order to give it a slightly more human touch. In addition, some
people's opinions will carry more weight than others, especially in
a small community of users, so it is worthwhile to identify
opinions by name.

The ranking consists of a numeric rank between 0 and 10,
stored in a TINYINT. It is accompanied by
optional (NULL) comments, stored in a
TEXT column, where the user can elaborate on his
or her ideas.

To ensure each user can rank each product only once, we have
MySQL require the combination of item ID and e-mail address to be
unique in the Rankings table. By combining the two in this way, the
database itself will reject any attempt to enter two rows in which
the combination of e-mail address and item_id
are identical.

Adding Categories

Now that we have seen how the data will be stored, we will
write several programs that allow users to enter information into
the tables. The interface presented here might seem a bit
primitive, but this doesn't really matter. Since the information is
stored in a database, we can always write new programs to improve
or change the interface.

Since each RankItem must be placed in a category, we must
first write a program that adds new categories to the system.
Listing 1, rank-category.pl (see Resources), is a simple example of
such a program.

When invoked with the GET
method, such as when a user enters a URL into a browser,
rank-category.pl produces an HTML form into which details about a
site can be entered. The form is submitted to rank-category.pl
which handles the data submitted with POST, adding a new row into
the RankCategories table. After the new row is added to the
database, rank-category.pl displays the HTML form once again. This
makes it relatively easy to add multiple categories.

rank-category.pl, as well as
the other programs presented this month, use
CGI.pm, the standard Perl module
(written by Lincoln Stein) for working with CGI programs. We also
import a number of subroutines into the current name space (grouped
together with the :standard tag), making it
easier to produce HTML output. The
p function, for example, surrounds
text with <P> and </P> tags. Thus the following:

print p("Hello");

is the same as

print "<P>Hello</P>";

but is more readable. In addition, CGI.pm's HTML subroutines
can work on multiple strings and can be nested:

print p("This will be in", b("bold"), "type");

CGI.pm automatically inserts whitespace between arguments to
these subroutines, so there is no need to put spaces at the end or
beginning of quoted strings.

To set HTML tag attributes, pass a hash reference to the
subroutine as the first argument. For example, the following code
comes from rank-category.pl (Listing 1):

The above code puts the text “rank-something” inside of a
hyperlink pointing to /cgi-bin/rank-something.pl, producing the
same HTML as the following, without having to worry about quotes or
whitespace:

By default, CGI.pm does not insert whitespace between HTML tags
generated using these routines. This might be more efficient for
the computers, but makes it difficult to examine and debug the HTML
output. We thus use CGI::Pretty, a
subclass of CGI.pm which inserts appropriate whitespace between the
tags. (If your system does not support CGI::Pretty, use CGI.pm
instead. The output will not be formatted as nicely, but will still
work.)

Regardless of how it is invoked, rank-category.pl always
produces an HTML form consisting of two text fields,
new_category_name and
new_category_description. These are submitted
back to rank-category.pl, which inserts the information into the
database.

The program's connection to the database is managed by DBI,
the Perl database interface available from CPAN (the comprehensive
Perl archive network, at
http://www.cpan.org/). DBI
provides a generic API to a relational database, in conjunction
with a specific driver (DBD) for each type of database server. DBDs
are available for most popular brands of database server and make
it possible to port Perl programs across platforms with a minimum
of difficulty.

Assuming that new_category_name and
new_category_description contain valid data, we
insert a new row into the RankCategories table by building up an
SQL query. Here is the Perl code that performs this task:

Notice how $sql contains question marks
(?) rather than actual data. These values are
set in $sth->execute, with the first argument
($new_category_name) being assigned to the first
place holder and the second argument
($new_category_description) assigned to the
second.

rank-category.pl does not
benefit in an obvious way from the use of place holders, which
speed up multiple invocations of the same SQL query, with minor
variations. However, using place holders also makes it possible to
avoid problems that can arise when interpolated variable values
contain ' and " characters.
DBI handles and quotes them automatically, making the program more
readable and eliminating the possibility of these sorts of
errors.

Creating the Ranking Form

Once one or more categories have been added to
RankCategories, users can begin to rank individual items. This is
accomplished with Listing 2, rank-something.pl (see Resources),
which produces an HTML form so that users can rank an item. Users
have the option of ranking something already in the database, or
adding a new item inside an existing category.

The form must be created by a program, rather than written as
a static document, because it displays pop-up menus of catalogs and
items from the database tables. When it is first invoked,
rank-something.pl retrieves the items and categories from RankItems
and RankCategories, respectively, putting them into hashes for easy
retrieval.

DBI makes it relatively easy to retrieve a table into a hash
with its fetchrow_hashref method.
However, the documentation explicitly states that fetchrow_hashref
is not very efficient, so we use
fetchrow_arrayref, retrieving each
row as its own array reference and storing the information in a
hash:

The “existing item” and “new item” sections of the form
both use the popup_menu function
provided by CGI.pm for the creation of
<select> lists. A pop-up menu has a name
and a set of potential values, each of which can be optionally
associated with descriptive text. For example:

The above HTML creates a three-element pop-up menu named
“number”. This menu can pass one of three options
(1, 2 or
3) to a CGI program. However, the user never
sees the values; instead, the menu is labeled with English words
corresponding to the values. While the user can select and submit
any of the menu's values, the two element is
selected by default.

Given that the existing items are placed in the
%items hash (just as the above code places
existing categories in the %categories hash), we
can create a pop-up menu with the following code:

popup_menu returns a text
string, appropriate for passing to
print or saving to disk (if the
program is creating a file of HTML-formatted text). The
name parameter sets the name
attribute, and the default parameter indicates
which attribute will be selected by default.

The values parameter takes an array
reference as an argument. In this case, the values should be the
keys of %items (i.e., the primary keys from
RankItems), but sorted in the order of each key's
item_name value. The solution is to sort
keys %items by value, rather than by key.
Adding a 0-value element to the front of the
values parameter means the first option will
always have a value of 0. MySQL does not use 0 in
AUTO_INCREMENT columns, and Perl sees 0 as
false—so setting the default value to 0 will never collide with an
actual value of item_id, and can easily be
identified when passed to our program via
POST.

By default, popup_menu will set the descriptive text for each
<option> to the value itself. However, the
labels parameter makes it possible to assign
custom labels to some or all values. The labels
parameter takes a hash reference as input. Since a hash in list
context turns into a comma-separated list of keys and values, we
can stick %items into the hash reference,
preceded by a mapping from 0 to the text “Choose
one”.

Unfortunately, HTML does not support hierarchical pop-up
menus. It would be easiest and best for the user if we could use
popup_menu to provide a menu of categories that would lead to a
list of associated items. Given the choice between requiring a
second program invocation (as we did with view-ranking.pl,
described below) and producing a flat list of items, I chose the
latter. Another approach is to use the method pioneered by Yahoo!
and used by Epinions, in which each category hyperlink is followed
by links to the most popular individual items. Implementing such an
approach would require some changes to rank-something.pl (and
presumably view-ranking.pl), but not to the underlying
database.

Inserting a New Ranking

When rank-something.pl is invoked with the POST method, it
looks for a new item that it will need to insert into RankItems.
The actual insertion is rather straightforward, using DBI's
prepare and
execute methods, as we have seen
before:

We can now be sure the item to be ranked is in RankItems. If the
item was already in RankItems, then we knew its primary key from
the <select> list in the HTML form.
However, how can we retrieve the primary key of the new item just
inserted? The DBI driver for MySQL provides an attribute, called
mysql_insertid, which returns the primary key of
the most recently inserted row. We can retrieve this value and
store it in $item_id, which otherwise would get
its value from the pop-up menu:

$item_id = $dbh->{"mysql_insertid"};

As we saw earlier, each ranking consists of an integer between 0
and 10 (from worst to best), as well as user comments about the
item. In order to avoid potential formatting problems, I decided to
remove all HTML tags from the comments. Using Perl's non-greedy
regular expressions, this is an easy task:

$comments =~ s|<.*?>||g;

The above substitutes all occurrences of <, followed by zero or
more characters, followed by >, with the empty string. The |g
modifier at the end of the s||| operator performs this operation
globally.

Next, we handle paragraph separators. We want people to be
able to enter more than one paragraph, but cannot let them use
<P> and </P> tags to do so. The solution is to treat
every occurrence of more than one \r (carriage return) or \n
(newline) character as a paragraph separator. UNIX machines rarely
produce \r characters in text input, but DOS/Windows systems end
lines with a combination of \r\n, and Macintoshes use a single \r.
The following turns any two (or more) of these into a <br>
tag, followed by two newlines:

Once again, we use question marks (?) as place
holders, potentially speeding up multiple queries and removing the
need for us to explicitly quote the individual items.

If the INSERT is successful, the user is given a short
message:

print p("Your ranking was successfully entered.");

Following this message, the user is given the chance to rank
another item in the database or view the current rankings database
by clicking on a hyperlink.

Retrieving Data

Finally, once data has been entered into Rankings, we can
write Listing 3, view-ranking.pl (see Resources), a program that
lets us look through the rankings and read them. There are many
ways to present the information, and I took the easy way in this
version of the program, forcing users to go through two menus (one
of categories and a second of items in that category) before seeing
the list of rankings for a particular item. Once again, we use the
GET/POST trick to write a program that both creates a form and
accepts its data.

If view-ranking.pl is invoked with GET, it creates a simple
HTML form with the categories in a pop-up menu:

This seemingly complex SQL query retrieves information from all
three tables, getting the category name from RankCategories and
information on this item from RankItems. The AVG function returns
the average value from all returned rows, making it possible to get
a sense of where the item should truly be ranked. Of course, there
is no way to stop a malicious user (or the owner of a ranked
business) from trying to skew the scales in a particular direction,
so it is just as important for users to read the comments and the
individual rankings as the average score.

Once we have retrieved general information on the item, we
perform a second SELECT,
requesting all rows for this item in chronological order:

HTML's <dt> and <dd> tags are perfect for formatting
these sorts of comments, handling the indentation automatically.
Perl's x operator, which multiplies a text
string, makes it simple to produce the correct number of stars
associated with a particular review.

Conclusion

The ranking software I presented this month is only in its
earliest stages and will undoubtedly have improved by the time this
issue of Linux Journal reaches your
hands.

Along with improvements in the user interface and the
possible addition of a hierarchical category system, there are
several related applications I would like to add. One would return
the highest-ranked items within a category, allowing a user to find
the best choice without having to wade through dozens or hundreds
of reviews. Hard-core users of this system might want to read only
those reviews which have appeared since the last time they visited
the site. Some additional personalization, including the use of
cookies to remember the user's name and e-mail address, would
reduce the amount of typing someone would have to do. An editing
facility for system administrators will undoubtedly prove useful as
the system grows. Finally, it is always nice to provide a search
function, just in case an Italian restaurant is accidentally
classified as a café.

The programs we examine this month demonstrate that it is not
particularly difficult to produce a simple user ranking system.
This is especially true if we take advantage of the power a
relational database affords us in storing and retrieving data. Best
of all, subscribers to my e-mail list can now spend their time
trading information, leaving the recommendations to a set of CGI
programs.

Reuven M. Lerner
, an Internet and Web
consultant, moved to Modi'in, Israel following his November
marriage to Shira Friedman-Lerner. His book Core Perl
will be published by Prentice-Hall in the spring. Reuven can be
reached at reuven@lerner.co.il. The ATF home page, including
archives, source code and discussion forums, is at
http://www.lerner.co.il/atf/.