When it comes to building a large e-commerce Web site, everyone is full
of advice. Developers will tell you that only a site built in C++ or
Java (depending on which they prefer) can scale up to handle heavy
traffic. Application server vendors will insist that you need a
packaged all-in-one solution for the software. Hardware vendors will
tell you that you need the top-of-the-line mega-machines to run a large
site. This is a story about how we built a large e-commerce site using
mainly open-source software and commodity hardware. We did it, and you
can do it, too.

Perl has long been the preferred language for developing CGI scripts.
It combines supreme flexibility with rapid development. Programming Perl
is still O'Reilly's top-selling technical book, and community
support abounds. Lately though, Perl has come under attack from certain
quarters. Detractors claim that it's too slow for serious development
work and that code written in Perl is too hard to maintain.

The mod_perl Apache module changes the whole performance picture for
Perl. Embedding a Perl interpreter inside of Apache provides
performance equivalent to Java servlets, and makes it an excellent
choice for building large sites. Through the use of Perl's
object-oriented features and some basic coding rules, you can build a
set of code that is a pleasure to maintain, or at least no worse than
other languages.

When you combine Apache, mod_perl and open-source code available from
CPAN (the Comprehensive Perl Archive Network), you get a set of
features equivalent to a commercial application server:

Session handling

Load balancing

Persistent database connections

Advanced HTML templating

Security

You also get some things you won't get from a commercial product, such as
a direct line to the core development team through the appropriate
mailing list and the ability to fix problems yourself instead of
waiting for a patch. Moreover, each part of the system is under your
control, making you limited only by your team's abilities.

When we first arrived at eToys in 1999, we found a situation that is
probably familiar to many who have joined a growing startup Internet
company. The system was based on CGI scripts talking to a MySQL
database. Static file serving and dynamic content generation were
sharing resources on the same machines. The CGI code was largely
written in a Perl4-ish style and not as modular as it could be; which
was not surprising since most of it was built as quickly as possible by
a small team.

Our major task was to figure out how to get this system to scale large
enough to handle the expected Christmas traffic. The toy business is
all about seasonality, and the difference between the peak selling
season and the rest of the year is enormous. The site had barely
survived the previous Christmas, and the MySQL database didn't look
like it could scale much further.

The call had already been made to switch to Oracle, and a DBA team was
in place. We didn't have enough time to do a redesign of the software,
so we had to scramble to put in place whatever performance improvements
we could finish by Christmas.

Apache::PerlRun is a module that exists to smooth the transition
between basic CGI and mod_perl. It emulates a CGI environment, and
provides some (but not all) of the performance benefits associated with
code written for mod_perl. Using this module and the persistent
database connections provided by Apache::DBI, we were able to do a
basic port to mod_perl and Oracle in time for Christmas, and
combined with some new hardware we were ready to face the Christmas
rush.

The peak traffic lasted for eight weeks, most of which were spent
frantically fixing things or nervously waiting for something else to
break. Nevertheless, we made it through. During that time, we collected
the following statistics:

It was clear that we would need to do a redesign for 2000. We had
reached the limits of the current system and needed to tackle some of
the harder problems that we had been holding off on.

Goals for the new system included moving away from offline page
generation. The old system had been building HTML pages for each
product and product category on the site in a batch job and dumping
them out as static files. This was effective when we had a small
database of products since the static files gave such good performance,
but we had recently added a children's bookstore to the site that
increased the size of our product database by an order of magnitude and
made the time required to generate each page prohibitive. We needed a
strategy that would only require us to build pages that customers were
actually interested in and would still provide solid performance.

We also wanted to re-do the database schema for more flexibility, and
structure the code in a more modular way that would make it easier for
a team to share the development work without stepping on one another. We
knew that the new codebase would have to be flexible enough to support
a continuously evolving set of features.

Not all of the team had significant experience with object-oriented
Perl, so we brought in Randal Schwartz and Damian Conway to do training
sessions with us. We created a set of coding standards, drafted a
design and built our system.

The machine strategy for the system is a fairly common one: low-cost
Intel-based servers with a load-balancer in front of them, and big iron
for the database.

Like many commercial packages, we have separate systems for the
front-end Web servers (which we call proxy servers) and the application
servers that generate the dynamic content. Both the proxy servers and
the application servers are load-balanced using dedicated hardware from
f5 Networks. More details about each of the systems shown here is
provided below.

We chose to run Linux on our proxy and application servers, a common
platform for mod_perl sites. The ease of remote administration under
Linux made the clustered approach possible. Linux also provided solid
security features and automated build capabilities to help with adding
new servers.

The proxy servers run a slim build of Apache, without mod_perl. They
have several standard Apache modules installed, in addition to our own
customized version of mod_session, which assigns session cookies.
Because the processes are so small, we can run as many as 400 Apache
children per machine. These servers handle all image requests
themselves, and pass page requests on to the application servers. They
communicate with the app servers using standard HTTP requests, and
cache the page results when appropriate headers are sent from the app
servers. The cached pages are stored on a shared NFS partition of a
Network Appliance filer. Serving pages from the cache is nearly as fast
as serving static files, i.e. very fast.

This kind of reverse-proxy setup is a commonly recommended approach
when working with mod_perl, since it uses the lightweight proxy
processes to send the content to clients (who may be on slow
connections) and frees the resource-intensive mod_perl processes to
move to the next request. For more information on why this
configuration is helpful, see the mod_perl developer's guide at
http://perl.apache.org/guide/.

The application servers run mod_perl, and little else. They
have a local cache for Perl objects, using Berkeley DB. The Web
applications run here, and shared resources like HTML templates are
mounted over NFS from the NetApp filer. Because they do the heavy
lifting in this setup, these machines are somewhat beefy, with dual
CPUs and 1GB of RAM each.

There is a third set of machines dedicated to handling searches. Since
searching was such a large percentage of overall traffic, it was
worthwhile to dedicate resources to it and take the load off the
application servers and database.

The software on these boxes is a multi-threaded daemon that we
developed in-house using C++. The application servers talk to the
search servers using a Perl module. The search daemon accepts a set of
search conditions and returns a sorted list of object IDs of the
products whose data fits those conditions. Then the application servers
look up the data to display these products from the database. The
search servers know nothing about HTML or the Web interface.

This approach of finding the IDs with the search server and then
retrieving the object data may sound like a performance hit, but in
practice the object data usually comes from the application server's
cache rather than the database. This design allows us to minimize the
duplicated data between the database and the search servers, making it
easier and faster to refresh the index. It also allows us to reuse the same
Perl code for retrieving product objects from the database, regardless
of how they were found.

The daemon uses a standard inverted word list approach to searching.
The index is periodically built from the relevant data in Oracle. If
you prefer an all-Perl solution, then there are modules on CPAN that
implement this approach, including Search::InvertedIndex and
DBIx::FullTextSearch. We chose to write our own because of the
tight performance requirements on this part of the system, and because
we had an unusually complex set of sorting rules for the returned IDs.

We took pains to make sure that we would be able to provide
load-balancing among nodes of the cluster and fault-tolerance in case
one or more nodes failed. The proxy servers are balanced using a random
selection algorithm. A user might end up on a different one on each
request. These servers don't hold any state information, so the goal is
just to distribute the load evenly.

The application servers use "sticky" load balancing. That
means that once a user goes to a particular app server, all of her
subsequent requests during that session will also be passed to the same
app server. The f5 hardware accomplishes this using browser cookies.

The load balancers run a periodic service check on each server and
remove any servers that fail the check from rotation. When a server
fails, all users that were "stuck" to that machine are
moved to another one.

In order to ensure that no data is lost if an app server dies, all
updates are written to the database. As a result, user data like the
contents of a shopping cart is preserved even in cases of catastrophic
hardware failure on an app server. This is essential for a large
e-commerce site.

The database has a separate failover system, which we will not go into
here. It follows standard practices recommended by our vendors.

The code is structured around the classic Model-View-Controller
pattern, originally from SmallTalk and now often applied to Web
applications. The MVC pattern is a way of splitting an application's
responsibilities into three distinct layers.

Classes in the Model layer represent business concepts and data, like
products or users. These have an API but no end-user interface. They
know nothing about HTTP or HTML and can be used in non-Web applications
like cron jobs. They talk to the database and other data sources, and
manage their own persistence.

The Controller layer translates Web requests into appropriate actions
on the Model layer. It handles parsing parameters, checking input,
fetching the appropriate Model objects, and calling methods on them.
Then it determines the appropriate View to use and sends the resulting
HTML to the user.

View objects are really HTML templates. The Controller passes data from
the Model objects to them and they generate a Web page. These are
implemented with the Template Toolkit, a powerful templating system
written in Perl. The templates have some basic conditional statements
and looping in them, but only enough to express the formatting logic.
No application control flow is embedded in the templates.

The core of the performance strategy is a multi-tier caching system. On
the application servers, data objects are cached in shared memory with
a backing store on local disk. Applications specify how long a data
object can be out of sync with the database, and all future accesses
during that time are served from the high-speed cache. This type of
cache control is known as ``time-to-live.'' The local cache is
implemented using a Berkeley DB database. Objects are serialized
with the standard Storable module from CPAN.

Data objects are divided into pieces when necessary to provide finer
granularity for expiration times. For example, product inventory is
updated more frequently than other product data. By splitting up the
product data, we can use a short expiration for inventory that keeps
it in tighter sync with the database, while still using a longer
expiration for the less volatile parts of the product data.

The application servers' object caches share product data between them
using the IP Multicast protocol and custom daemons written in C. When a
product is placed in the cache on one server, the data is replicated to
the cache on all other servers. This technique is successful
because of the high locality of access in product data. During the 2000
Christmas season this cache achieved a 99 percent hit ratio, thus taking a
large amount of work off the database.

In addition to caching the data objects, entire pages that are not
user-specific, like product detail pages, can be cached. The
application takes the shortest expiration time of the data objects used
in the pages and specifies that to the proxy servers as a page
expiration time, using standard Expires headers. The proxy servers
cache the generated page on a shared NFS partition. Pages served from
this cache have performance close to that of static pages.

To allow for emergency fixes, we added a hook to mod_proxy that
deletes the cached copy of a specified URL. This was used when a page
needed to be changed immediately to fix incorrect information.

An extra advantage of this mod_proxy cache is the automatic handling
of If-Modified-Since requests. We did not need to implement this
ourselves since mod_proxy already provides it.

Users are assigned session IDs using HTTP cookies. This is done at the
proxy servers by our customized version of mod_session. Doing it at
the proxy ensures that users accessing cached pages will still get a
session ID assigned. The session ID is simply a key into data stored on
the server-side. User sessions are assigned to an application server
and continue to use that server unless it becomes unavailable. This is
called "sticky" load balancing. Session data and other data
modified by the user -- such as shopping cart contents -- is written to
both the object cache and the database. The double write carries a
slight performance penalty, but it allows for fast read access on
subsequent requests without going back to the database. If a server
failure causes a user to be moved to a different application server,
then the data is simply fetched from the database again.

A large e-commerce site is a popular target for all types of attacks.
When designing such a system, you have to assume that you will be
attacked and build with security in mind, at the application level as
well as the machine level.

The main rule of thumb is "don't trust the client!"
User-specific data sent to the client is protected using multiple
levels of encryption. SSL keeps sensitive data exchanges private from
anyone snooping on network traffic. To prevent "session
hijacking" (when someone tampers with their session ID in order
to gain access to another user's session), we include a Message
Authentication Code (MAC) as part of the session cookie. This is
generated using the standard Digest::SHA1 module from CPAN, with a
seed phrase known only to our servers. By running the ID from the
session cookie through this MAC algorithm, we can verify that the data
being presented was generated by us and not tampered with.

In situations where we need to include some state information in an
HTML form or URL and don't want it to be obvious to the user, we use
the CPAN Crypt:: modules to encrypt and decrypt it. The
Crypt::CBC module is a good place to start.

To protect against simple overload attacks, when someone uses a program
to send high volumes of requests at our servers hoping to make them
unavailable to customers, access to the application servers is
controlled by a throttling program. The code is based on some work by
Randal Schwartz in his Stonehenge::Throttle module. Accesses for
each user are tracked in compact logs written to an NFS partition. The
program enforces limits on how many requests a user can make within a
certain period of time.

When planning this system, we considered using Java as the
implementation language. We decided to go with Perl, but we really
missed Java's nice exception-handling features. Luckily, Graham Barr's
Error module from CPAN supplies similar capabilities in Perl.

Perl already has support for trapping runtime errors and passing
exception objects, but the Error module adds some nice syntactic sugar.
The following code sample is typical of how we used the module:

The module allows you to create your own exception classes and trap for
specific types of exceptions.

One nice benefit of this is the way it works with DBI. If you turn
on DBI'sRaiseError flag and use try blocks in places where you
want to trap exceptions, the Error module can turn DBI errors
into simple Error objects.

This code shows a condition where an error would indicate that we
should roll back a database transaction. In practice, most DBI
errors indicate something unexpected happened with the database and the
current action can't continue. Those exceptions are allowed to
propagate up to a top-level try{} block that encloses the whole
request. When errors are caught there, we log a stacktrace and send a
friendly error page back to the user.

Both the HTML and the formatting logic for merging application data
into it is stored in the templates. They use a CPAN module called
Template Toolkit, which provides a simple but powerful syntax for
accessing the Perl data structures passed to them by the application.
In addition to basics such as looping and conditional statements, it
provides extensive support for modularization, allowing the use of
includes and macros to simplify template maintenance and avoid
redundancy.

We found Template Toolkit to be an invaluable tool on this project.
Our HTML coders picked it up quickly and were able to do nearly
all of the templating work without help from the Perl coders. We
supplied them with documentation of what data would be passed to each
template and they did the rest. If you have never experienced the joy
of telling a project manager that the HTML team can handle his
requested changes without any help from you, then you are seriously missing
out!

Template Toolkit compiles templates into Perl bytecode and caches
them in memory to improve efficiency. When template files change on
disk they are picked up and re-compiled. This is similar to how other
mod_perl systems like Mason and Apache::Registry work.

By varying the template search path, we made it possible to assign
templates to particular sections of the site, allowing a customized
look and feel for specific areas. For example, the page header template
in the bookstore section of the site can be different from the one in
the video game store section. It is even possible to serve the same
data with a different appearance in different parts of the site,
allowing for co-branding of content.

This is a sample of what a basic loop looks like when coded in
Template Toolkit:

In addition to the things you see here, there are a few interesting
details about the ESF::Control base class. All requests are
dispatched to the ESF::Control->run() method first, wrapping
them in a try{} block before calling the appropriate handler()
method. It also provides the process_template() method, which runs
Template Toolkit and then sends the results with appropriate
HTTP headers. If the Controller specifies it, then the headers can include
Last-Modified and Expires for control of page caching by the
proxy servers.

This is a simple Model object. Most Model objects would have some
database and cache interaction. They would include a load() method
that accepts an ID and loads the appropriate object state from the
database. Model objects that can be modified by the application would
also include a save() method.

Note that because of Perl's flexible OO style, it is not necessary to
call new() when loading an object from the database. The load()
and new() methods can both be constructors for use in different
circumstances, both returning a blessed reference.

The load() method typically handles cache management as well as
database access. Here's some pseudo-code showing a typical load()
method:

The save method would use the same approach in reverse, saving first to
the cache and then to the database.

One final thing to notice about our Model class is the view()
method. This method exists to give the object an opportunity to shuffle
it's data around or create a separate data structure that is easier for
use with a template. This can be used to hide a complex implementation
from the template coders. For example, remember the partitioning of the
product inventory data that we did to allow for separate cache
expiration times? The product Model object is really a façade
for several underlying implementation objects, but the view() method
on that class consolidates the data for use by the templates.

To finish our Hello World example, we need a template to render the
view. This one will do the job:

Since Perl code executes so quickly under mod_perl, the performance
bottleneck is usually at the database. We applied all the documented
tricks for improving DBD::Oracle performance. We used bind
variables, prepare_cached(), Apache::DBI, and adjustments to the
RowCache buffer size.

The big win of course is avoiding going to the database in the first
place. The caching work we did had a huge impact on performance.
Fetching product data from the Berkeley DB cache was about 10 times
faster than fetching it from the database. Serving a product page from
the proxy cache was about 10 times faster than generating it on the
application server from cached data. Clearly, the site would never have
survived under heavy load without the caching.

Partitioning the data objects was also a big win. We identified several
different subsets of product data that could be loaded and cached
independently. When an application needed product data, it could
specify which subset was required and skip loading the unnecessary data
from the database.

Another standard performance technique we followed was avoiding
unnecessary object creation. The Template object is created the
first time it's used and then cached for the life of the Apache
process. Socket connections to search servers are cached in a way
similar to what Apache::DBI does for database connections. Resources
that are used frequently within the scope of a request, such as
database handles and session objects, were cached in mod_perl's
$r->pnotes() until the end of the request.

When trying out a new technology like the Error module, there are
bound to be some things to watch out for. We found a certain code
structure that causes a memory leak every time it is executed. It
involves nested try{} blocks, and looks like this:

It's not Graham Barr's fault that this leaks; it is simply a byproduct
of the fact that the try and catch keywords are implemented using
anonymous subroutines. This code is equivalent to the following:

my $foo;
$subref1 = sub {
$subref2 = sub {
$foo++;
};
};

This nested subroutine creates a closure for $foo and will make a
new copy of the variable each time it is executed. The situation is
easy to avoid once you know to watch out for it.

One of the big wins in our architecture was the use of Berkeley DB.
Since most people are not familiar with it's more advanced features,
we'll give a brief overview here.

The DB_File module is part of the standard Perl distribution.
However, it only supports the interface of Berkeley DB version 1.85,
and doesn't include the interesting features of later releases. To get
those, you'll need the BerkeleyDB.pm module, available from CPAN.
This module can be tricky to build, but comprehensive instructions are
included.

Newer versions of Berkeley DB offer many features that help
performance in a mod_perl environment. To begin with, database files
can be opened once at the start of the program and kept open, rather
than opened and closed on each request. Berkeley DB will use a
shared memory buffer to improve data access speed for all processes
using the database. Concurrent access is directly supported with
locking handled for you by the database. This is a huge win over
DB_File, which requires you to do your own locking. Locks can be at
a database level, or at a memory page level to allow multiple
simultaneous writers. Transactions with rollback capability are also
supported.

This all sounds too good to be true, but there are some downsides. The
documentation is somewhat sparse, and you will probably need to refer
to the C API if you need to understand how to do anything complicated.

A more serious problem is database corruption. When an Apache process
using Berkeley DB dies from a hard kill or a segfault, it can
corrupt the database. A corrupted database will sometimes cause
subsequent opening attempts to hang. According to the people we talked
to at Sleepycat Software (which provides commercial support for
Berkeley DB), this can happen even with the transactional mode of
operation. They are working on a way to fix the problem. In our case,
none of the data stored in the cache was essential for operation so we
were able to simply clear it out when restarting an application server.

Another thing to watch out for is deadlocks. If you use the page-level
locking option, then you have to handle deadlocks. There is a daemon
included in the distribution that will watch for deadlocks and fix
them, or you can handle them yourself using the C API.

After trying a few different things, we recommend that you use
database-level locking. It's much simpler, and cured our problems. We
didn't see any significant performance hit from switching to this mode
of locking. The one thing you need to watch out for when using
exclusive database level write locks are long operations with cursors
that tie up the database. We split up some of our operations into
multiple writes in order to avoid this problem.

If you have a good C coder on your team, you may want to try the
alternate approach that we finally ended up with. You can write your
own daemon around Berkeley DB and use it in a client/server style
over Unix sockets. This allows you to catch signals and ensure a safe
shutdown. You can also write your own deadlock handling code this way.

If you plan to do any serious Perl development, then you should really take
the time to become familiar with some of the available development
tools. The debugger in particular is a lifesaver, and it works with
mod_perl. There is a profiler called Devel::DProf, which also
works with mod_perl. It's definitely the place to start when
performance tuning your application.

We found the ability to run our complete system on individual's
workstations to be extremely useful. Everyone could develop on his own
machine, and coordinate changes using CVS source control.

For object modeling and design, we used the open-source Dia program
and Rational Rose. Both support working with UML and are great for
generating pretty class diagrams for your cubicle walls.

Do Try This at Home

Since we started this project, a number of development frameworks that
offer support for this kind of architecture have come out. We don't
have direct experience using these, but they have a similar design and
may prove useful to you if you want to take an MVC approach with your
system.

Apache::PageKit is a mod_perl module available from CPAN that
provides a basic MVC structure for Web applications. It uses the
HTML::Template module for building views.

OpenInteract is a recently released Web application framework in
Perl, which works together with the persistence layer SPOPS. Both
are available from CPAN.

The Application Toolkit from Extropia is a comprehensive set of Perl
classes for building Web apps. It has excellent documentation and takes
good advantage of existing CPAN modules. You can find it on
http://www.extropia.com/.

If you want a ready-to-use cache module, take a look at the Perl-cache
project on http://sourceforge.net/. This is the next generation of
the popular File::Cache module.

The Java world has many options as well. The Struts framework, part
of the Jakarta project, is a good open-source choice. There are also
commercial products from several vendors that follow this sort of
design. Top contenders include ATG Dynamo, BEA WebLogic, and
IBM WebSphere.

By building on the open-source software and community, we were able to
create a top-tier Web site with a minimum of cost and effort. The
system we ended up with is scalable to huge amounts of traffic. It runs
on mostly commodity hardware making it easy to grow when the need
arises. Perhaps best of all, it provided tremendous learning
opportunities for our developers, and made us a part of the larger
development community.

We've contributed patches from our work back to various open-source
projects, and provided help on mailing lists. We'd like to take this
opportunity to officially thank the open-source developers who
contributed to projects mentioned here. Without them, this would not
have been possible. We also have to thank the hardworking Web
developers at eToys. The store may be closed, but the talent that built
it lives on.

If you have questions about this material, you can contact us at the
following e-mail addresses: