As you will see in the chapter on scaling, it may become important to
facilitate occasional face-to-face meetings among subgroups of users.
Thus it will be helpful to record their country of residence and
postal code (what Americans call "Zoning Improvement Plan code" or
"ZIP code").

Fat versus Skinny: the Ideas

Suppose that the system specifications start off with very simple
requirements for the user database, one that can be handled by the
following single table:

Notice that the comment about password encryption is placed above,
rather than below, the column name and that the primary key constraint
is clearly visible to other programmers. It is good to get into the
habit of writing data model files in a text editor and including
comments and examples of the queries that you expect to support. If
you use a desktop application with a graphical user interface to
create tables you're losing a lot of important design information.
Remember that the data model is the most critical part of your
application. You need to think about how you're going to communicate
your design decisions to other programmers.

After a few weeks online, someone says, "wouldn't it be nice to see the
user's picture and hyperlink through to his or her home page?"

The table just keeps getting fatter. As the table gets fatter, more
and more columns are likely to be NULL for any given user. With
Oracle 9i you're unlikely to run up against the hard database limit of
1000 columns per table. Nor is there a storage efficiency problem.
Nearly every database management system is able to record a NULL value
with a single bit, even if the column is defined
char(500) or whatever. Still, something seems unclean
about having to add more and more columns to deal with the possibility
of a user having more and more phone numbers.

Medical informaticians have dealt with this problem for many years.
The example above is referred to as a "fat data model." In the
hospital world you'll very likely find something like this for storing
patient demographic and insurance coverage data. But for laboratory
tests, the fat approach begins to get ugly. There are thousands of
possible tests that a hospital could perform on a patient. New tests
are done every day that a patient is in the hospital. Some hospitals
have experimented with a "skinny" data model for lab tests. The table
looks something like the following:

Note that this table doesn't have a lot of integrity constraints. If
you were to specify patient_id as unique that would
limit each hospital patient to having only one test done. Nor does it
work to specify the combination of patient_id and
test_date as unique because there are fancy machines that
can do multiple tests at the same time on a single blood sample, for
example.

Figure 5.1:
Example user record that is split between a skinny table and a second table.

If you're using a fancy commercial RDBMS and wish to make queries like
this really fast, check out bitmap indices, often documented under
"Data Warehousing". These are intended for columns of low
cardinality, i.e., not too many distinct values compared to the number
of rows in the table. You'd build a bitmap index on the
field_name column.

Note that numbers are stored in a column of type VARCHAR. Won't this
preclude queries such as "Find the average income of a registered
user"? Not if you're using Oracle. Oracle is smart about
automatically casting between character strings and numbers. It will
work just fine to

One complication of this kind of data model is that it is tough to use
simple built-in integrity constraints to enforce uniqueness if you're
also going to use the users_extra_info for many-to-one
relations.

For example, it doesn't make sense to have two rows in the info table,
both for the same user ID and both with a field name of "birthdate".
A user can only have one birthday. Maybe we should

(Note that this will make it really fast to fetch a particular field
for a particular user as well as enforcing the unique constraint.)

But what about "home_phone"? Nothing should prevent a user from
getting two home phone numbers and listing them both. If we try to
insert two rows with the "home_phone" value in the
field_name column and 451 in the user_id
column, the RDBMS will abort the transactions due to violation of the
unique constraint defined above.

How to deal with this apparent problem? One way is to decide that the
users_extra_info table will be used only for
single-valued properties. Another approach would be to abandon the
idea of using the RDBMS to enforce integrity constraints and put logic
into the application code to make sure that a user can have only one
birthdate. A complex but complete approach is to define RDBMS
triggers that run a short procedural program inside the RDBMS—in
Oracle this would be a program in the PL/SQL or Java programming
languages. This program can check that uniqueness is preserved for
fields that indeed must be unique.

Fat versus Skinny: the Decision

Deciding between fat-style and skinny-style is an engineering
judgement call. You can certainly build a working system using either
approach, but how much guidance does that give you? You know that you
can build a computer program in any Turing-complete computer language,
but that doesn't make Assembler as practical as Basic, C#, Eiffel,
Java, Lisp, or ML.

One argument in favor of fat-style is maintainability and
self-documentation. Fat is the convention in the database world. A
SQL programmer who takes over your work will expect fat. He or she
will sit down and start to understand your system by querying the
data dictionary, the RDBMS's internal representation of what
tables are defined. Here's how it looks with Oracle:

This is an adequate data model in the same sense that raw instructions
for a Turing machine is an adequate programming language. Querying
the data dictionary would be of no help toward understanding the
purpose of the application. One would have to sample the contents of
the rows of my_data to see what was being stored.
Suppose, by contrast, you were poking around in an unfamiliar database
and encountered this table definition:

Note the use of ISO country codes, constrained by reference to a table
of valid codes, to represent country in the table above. You don't
want records with "United States", "US", "us", "USA", "Umited Stares",
etc. These are maintained by the ISO 3166 Maintenance agency, from
which you can download the most current data in text format. See http://www.iso.ch/iso/en/prods-services/iso3166ma/index.html.

The author's source code comments have been stripped out, yet it is
reasonably clear that this table exists to support an online address
book. Moreover the purpose of each column can be inferred from its
name. Quite a few columns will be NULL for each address book entry,
but not so many that the table will be absurdly sparse. Because NULL
columns take up so little space in the database, you shouldn't decide
between skinny and fat based on presumed data storage efficiency.

Skinny is good when you are storing wildly disparate data on each
user, such that you'd expect more than 75 percent of columns to be
NULL in a fat data model. Skinny can result in strange-looking SQL
queries and data dictionary opacity.

User Groups

One of the most powerful constructs in an online community is a user
group. A group of users might want to collaborate on publishing some
content. A group of users might want a private discussion forum. A
group of users might be the only people authorized to perform certain
actions or view certain files. The bottom line is that you'll want to
be able to refer to groups of users from other objects in your
database.

When building user groups you might want to think about on-the-fly
groups. You definitely want to have a user group where each member is
represented by a row in a table: "user #37 is part of user group
#421". With this kind of data model people can explicitly join and
separate from user groups. It is also useful, however, to have groups
generated on-the-fly from queried properties. For example, it might
be nice to be able to say "this discussion forum is limited to those
users who live in France" without having to install database triggers
to insert rows in a user group map table every time someone registers
a French address. Rather than denormalizing the data, it will be much
cleaner to query for "users who live in France" every time group
membership is needed.

A typical data model will include a USERS table and a USER_GROUPS
table. This leads to a bit of ugliness in that many of the other
tables in the system must include two columns, one for user_id and one
for user_group_id. If the user_id column is not NULL, the row belongs
to a user. If the user_group_id is not NULL, the row references a
user group. Integrity constraints ensure that only one of the columns
will be non-NULL.

Representing Membership in a Group (First Normal Form)

Suppose that you have a USERS table and a USER_GROUPS table. How do
you represent the fact that User 234 is a member of Groups 17 and 18?
It might seem that the simplest way to do this is with a data
structure stored in a column within the USERS table:

In this case, we'd store the string "17 18" in the
group_memberships column. This is known as a
repeating group or a multivalued column and it has the
following problems:

you might not have enough space if the number of values in the
column grows larger than anticipated

the combination of table name, column name, and key value no longer
specifies a datum

the basic INSERT, UPDATE, and SELECT operations are not sufficient
to manipulate multivalued columns

programmers' brains will have to adapt simultaneously to unordered
data in table rows and ordered data inside a multivalued column

design opacity: If you use multivalued columns even once, people
will never know what to expect when they look under the hood of your
design; did you use multiple tables to express a many-to-one relation or
multivalued columns?

To get the data model into First Normal Form, in which there
are no multivalued columns, you'd create a mapping table:

Note that in Oracle the unique constraint results in the
creation of an index. Here it will be a concatenated index starting
with the user_id column. This index will make it fast to ask the
question "To which groups does User 37 belong?" but will be of no use
in answering the question "Which users belong to Group 22?"

A good general rule is that representing a many-to-one relation
requires two tables: Things A and Things B, where many
Bs can be associated with one A. Another general rule is that
representing a many-to-many relation requires three tables: Things A,
Things B, and a mapping table to associate arbitrary numbers of As
with arbitrary numbers of Bs.

Derivable Data

Storing users and groups in three tables seems as though it might be
inefficient and ugly. To answer the question "To which groups does
Norman Horowitz belong" we must JOIN the following tables:
users, user_groups, user_group_map:

Note the use of the _p suffix to denote a boolean column.
Oracle does not support a boolean data type and therefore we simulate
it with a CHAR(1) that is restricted to "t" and "f". The "p" in the
suffix stands for "predicate" and is a naming convention that dates
back to Lisp programmers circa 1960.

If this is a popular group, there is a temptation among new database
programmers to denormalize the data model by adding a column to
the users table, e.g., tanganyikan_group_member_p. This
column will be set to "t" when a user is added to the Tanganyikans
group and reset to "f" when a user unsubscribes from the group. This
feels like progress. We can answer our questions by querying one
table instead of three. Historically, however, RDBMS programmers have
been bitten badly any time that they stored derivable data,
i.e., information in one table that can be derived by querying other,
more fundamental, tables. Inevitably a programmer comes along who is
not aware of the unusual data model and writes application code that
updates the information in one place but not another.

This results in a virtual table containing all the columns of users
plus an additional column called
tanganyikan_group_membership that is 1 for users who are
members of the group in question and 0 for users who aren't. In
Oracle, if you want the column to bear the standard ANSI
boolean data type values, you can wrap the DECODE function around the
query in the select list:

Notice that we've added an "_p" suffix to the column name, harking
back to the Lisp programming language in which functions that could
return only boolean values conventionally had names ending in "p".

Keep in mind that data model complexity can always be tamed with
views. Note, however, that views are purely syntactic. If a query is
running slowly when fed directly to the RDBMS, it won't run any faster
simply by having been renamed into a view. Were you to have 10,000
members of a group, each of whom was requesting one page per second
from the group's private area on your Web site, doing three-way JOINs on
every page load would become a substantial burden on your RDBMS
server. Should you fix this by denormalizing, thus speeding up
queries by perhaps 5X over a join of indexed tables? No. Speed it up
by 1000X by caching the results of authorization queries in the
virtual memory of the HTTP server process.

Clean up ugly queries with views. Clean up ugly performance problems
with indices. If you're facing Yahoo! or Amazon levels of usage, look
into unloading the RDBMS altogether with application-level caching.

Access Control and Approval

Suppose that you are building a corporate knowledge-sharing site. You
might decide to place the server on the public Internet to facilitate
employee access while at home or traveling. Perhaps some close
collaborators within other organizations will be allowed access.
However, you won't want random people registering at the site and
getting immediate access. Each new user should probably have to be
approved by an administrator.

Or perhaps you're building a public online learning community. You
want users to be identified and accountable at the very least to their
Internet Service Provider. So you'll want to limit access to only
those registrants who've verified receipt of an email message at the
address that they supplied upon registering. You may also want to
reject registration from users whose only email address is at
hotmail.com or a similar anonymous provider.

A community may need to change its policies as the membership grows.

One powerful way to manage user access is by modeling user
registration as a finite-state machine, such as the one shown in
figure 5.1.

Figure 5.2:
A finite-state machine approach to user registration. A reader starts
in the "not a user" state. After filling out a registration form, he
progresses to the "Need Email Verification/Need Admin Approval" state.
After responding to an email message from the server he is moved into
the "Need Admin Approval" state. Suppose that on this site we have a
rule that anyone whose email ends in "mit.edu" is automatically
approved. In that case the reader is moved to the "Authorized" state,
which is where he will stay unless he decides to leave the service
("Deleted") or is deemed to be an unreasonable burden on moderators
("Banned").

Rather than checking columns admin_approved_p, email_verified_p,
banned_p, deleted_p in the users table on every
page load, this approach allows application code to examine only a
single user_state column.

The authors built a number of online communities with this same
finite-state machine and for each one made a decision with the
publisher as to whether or not any of these state transitions could be
made automatically. The Siemens Sharenet knowledge sharing system,
despite being inaccessible from the public Internet, elected to
require administrator approval for every new user. By contrast, on
photo.net users would go immediately from "Not a user" to
"Authorized".

Exercise 1: Data Model

Questions: Do you store users' passwords in the database encrypted or
non-encrypted? What are the advantages and disadvantages of
encryption? What columns in your tables will enable your system to
handle the query "Find me users who live within 50 kilometers of User
#37"?

Make sure that your data model and answers are Web-accessible and easy
to find from your main documentation directory, perhaps at the URL
/doc/.

Multi-Page Design and Flow

This book defers discussion of individual page design until the
"Content Management" chapter, but we need to think about page flow design
right now. Why? The bad design of a single page will offend a user;
the bad design of the page-to-page flow of a site will defeat a
user.

One of the things that users love about the Web is the way in which
computation is discretized. A desktop application is generally a
complex miasma in which the state of the project is only partially
visible. Despite software vendors having added multiple-level Undo
commands to many popular desktop programs, the state of those programs
remains opaque to users.

The first general principle of multi-page design is therefore Don't
break the browser's Back button. Users should be able to go
forward and back at any time in their session with a site. For
example, consider the following flow of pages on a shopping site:

choose a book

enter shipping address

enter credit card number

confirm

thank-you

A user who notices a typo in the shipping address on the confirm page
should be able to return to the shipping address entry form with the
Back button or the "click right" menu attached to the Back button,
correct the address and proceed from there. See the "Choosing between
GET and POST" section later in this chapter.

A second general principle is Have users pick the object first and
then the verb. For example, consider the customer service area of
an e-commerce site. Assume that Jane Consumer has already identified
herself to the server. The merchant can show Jane a list of all the
items that she has ever purchased. Jane clicks on an item (picking
the object) and gets a page with a list of choices, e.g., "return
for refund" or "exchange". Jane clicks on "exchange" (picking the
verb) and gets a page with instructions on how to schedule a
pickup of the unwanted item and pages offering replacement goods.

How original is this principle? It is lifted straight from the Apple
Macintosh circa 1984 and is explicated clearly in Macintosh
Human Interface Guidelines (Apple Computer, Inc.;
Addison-Wesley, 1993; full text available online at http://developer.apple.com/documentation/mac/HIGuidelines/HIGuidelines-2.html).
In a Macintosh word processor, for example, you select one word from
the document with a double-click (object). Then from the pull-down
menus you select an action to apply to this word, e.g., "put it into
italics" (verb). Originality is valorized in contemporary creative
culture, but it was not a value for medieval authors and it does not
help users. The Macintosh was enormously popular to begin with, and
its user interface was copied by the developers of Microsoft Windows,
which spread the object-then-verb idea to tens of millions of people.
Web publishers can be sure that the vast majority of their users will
be intimately familiar with the "pick the object then the verb" style
of interface. Sticking with a familiar user interface cuts down on
user time and confusion at a site.

These principles are especially easy to apply to user administration
pages, for example. The administrator looks at a list of users and
clicks on one to select it. The server produces a new page with a
list of possible actions to apply to that user.

Exercise 2: Page Flow

Start by sketching the page flow for user registration and
administration. There should be one circle or box for every URL in
the system and one arc for every possible transition from URL A to URL
B. If you have a lot of URLs that are form targets and perform
database updates, but redirect to other pages in the system for
display, you may want to distinguish those URLs with a light or dashed
outline.

Ideally this drawing should be scanned and made available in your
online documentation.

Figure 5.2 is an example of the kind of drawing we're looking for.

Figure 5.3:
Page flow documentation for a standalone birthday
reminder service. Email reminders are sent out either the day of, the
day before, or one week before the date each year. For more info on
this application, see chapter 15 of Philip and Alex's Guide to
Web Publishing at http://philip.greenspun.com/panda/.
Drawing by Mina Reimer.

Choosing between GET and POST

Following an ordinary hyperlink on an HTML page results in a GET
request to the server under the HTTP protocol. When programming a
page with an HTML form, you have a choice between using
METHOD=GET and METHOD=POST. A heavy
reliance on POST will result in a site that breaks the browser Back
button. An attempt to go back to a page that was the result of a POST
will generally bring up a "Page Expired" error message and possibly a
dialog box asking whether the user wishes to resubmit information by
using the "Refresh" button.

Some of our students asked for further guidance on how to choose
between GET and POST and here's the response from Ben Adida, part of
the course's teaching staff in fall 2003:

Most of you may be wondering, why GET vs. POST in submitting forms?
Oftentimes, one will use POST just to keep pretty URLs (without
?var=val&var=val). But that's the wrong way to think about it.
A GET implies that you are getting information. You can resubmit a GET
any number of times: you are just querying information, not performing
any actions on the back-end.
A POST implies that you are performing some action with side-effect:
inserting a row, updating a row, launching a missile, etc... That's
why when you try to reload a POST page, your browser warns you: are
you sure you want to launch another missile?
In general, you should strive to respect the above principles. Here are
two key examples:
- searching users or content. That should be a GET.
- Inserting a user or updating a profile. That should be a POST.
Of course, HTML and HTTP have some restrictions that complicate things:
a) GET forms are limited in length by how much your browser can send
in a URL field. This can be a problem for very complicated search
forms, though probably not an issue at this stage. If you do hit
that limit though, then it's okay to use a POST.
b) POST forms can only be performed by having an HTML button, or by
using JavaScript to submit a form. JavaScript is not ideal. Thus,
sometimes you want to have a link that is effectively an action
with side-effect (e.g. "ban user"), but you make it a GET.
You can use redirects (HTTP return code 302) to make your life easier.
The nice thing about correct 302's is that the URL that issues a 302 is
never kept in a browser's history, so it is never queried twice unless
the user does something really conscious (like click back and actively
resubmit the form). Specifically:
1) when you POST data for an insert or update, have your script
process the POST, then redirect to a thank-you page. That way,
if the user clicks "reload", they are simply reloading the
thank-you page, which is just a GET and won't cause side-effects
or warnings. You can also redirect to something more meaningful,
perhaps the list of recently registered users once you've edited
one.
2) when you use a GET link to actually perform an action with
side-effect, you can also have that target script perform its
action then immediately redirect to a script with no side
effects. This will prevent the accidental repetition of an
action.
Scripts that have side effects should not be reachable at URLs that
may enter the cache and be randomly re-requested by the browser. The
one exception is if the POST causes some kind of error: it's mostly
okay for the script that processes the POST to display the error
instead of redirecting to some kind of error-displaying page (which
would be clunky to build anyways).
.NET folks: when you use ASP.NET and postback, you have no choice of
method: it's always POST. Remember to follow the above rule for POST:
you can have your handler method perform the side-effects but it
should then redirect somewhere else instead of returning content.
I hope this helps in making your sites more usable. Let me know if
you have any questions.
-Ben
PS: None of the above solves the "double-click" problem which is what
happens if a user double-submits a form quickly (especially those
users who don't know the difference between single and double
clicking). We'll talk about double-click protection later.

Exercise 3

Build the basic user registration and login pages. Use HTTP cookies
to make the rest of the semester's work easier.

Questions: Can someone sniffing packets learn your user's password?
Gain access to the site under your user's credentials? What happens
to a user who forgets his or her password?

Exercise 4

Build the site administrator's pages for working with users. The
site administrator should be able to (1) see recently registered
users, (2) look up a particular user, (3) exclude a user from the
site, and (4) see current and historical statistics on user
registration.

Questions: How can the administrator control who is permitted to
register and use the site? What email notification options does the
site administrator have that relate to user registration?

Exercise 5

Look at your tables again for referential integrity constraints and
query performance. How long will it take to look up a user by email
address? What if this email address is capitalized differently from
what you've stored in the database? Is it possible to have two users
with the same email address? (Note that by Internet standards a
lowercase email address or hostname is the same as an uppercase email
address or hostname.)

Many Web applications contain content that can be viewed only by members
of a specific user group. With your data model, how many table rows
will the RDBMS have to examine to answer the question "Is User #541 a
member of Group #90"? If the answer is "every row in a big table",
i.e., a sequential scan, what kind of index could you add to
speed up the query?

Time and Motion

The data modeling, page flow sketching, and questions about query
performance will probably be done by the team as a group and require
approximately three hours. Implementing the page scripts could be
divided up among team members, but in any case should be doable in ten
programmer-hours.
Return to Table of Contents