Some secure programs accept data from one untrusted user (the attacker)
and pass that data on to a different user’s application (the victim).
If the secure program doesn’t protect the victim, the
victim’s application (e.g., their web browser)
may then process that data in a way harmful to the victim.
This is a particularly common problem for web applications using HTML or XML,
where the problem goes by several names including “cross-site scripting”,
“malicious HTML tags”, and “malicious content.”
This book will call this problem “cross-site malicious content,”
since the problem isn’t limited to scripts or HTML, and its cross-site nature
is fundamental.
Note that this problem isn’t limited to web applications, but since
this is a particular problem for them, the rest of this discussion
will emphasize web applications.
As will be shown in a moment, sometimes an attacker can cause a victim
to send data from the victim to the secure program, so the secure program
must protect the victim from himself.

Let’s begin with a simple example.
Some web applications are designed to
permit HTML tags in data input from users that will later
be posted to other readers (e.g., in a guestbook or “reader comment” area).
If nothing is done to prevent it,
these tags can be used by malicious users to attack other users by inserting
scripts,
Java references (including references to hostile applets), DHTML tags,
early document endings (via </HTML>), absurd font size requests,
and so on.
This capability can be exploited for a wide range of effects,
such as exposing SSL-encrypted connections, accessing restricted web
sites via the client, violating domain-based security policies,
making the web page unreadable,
making the web page unpleasant to use (e.g., via annoying banners
and offensive material),
permit privacy intrusions (e.g., by inserting a web bug to learn exactly
who reads a certain page),
creating
denial-of-service attacks (e.g., by creating an “infinite” number
of windows), and even very destructive attacks (by inserting
attacks on security vulnerabilities such as scripting languages or
buffer overflows in browsers).
By embedding malicious FORM tags at the right place, an intruder
may even be able to trick users into revealing sensitive information
(by modifying the behavior of an existing form).
Or, by embedding scripts, an intruder can cause no end of problems.
This is by no means an exhaustive list of problems, but
hopefully this is enough to convince you that this is a serious problem.

Most “discussion boards” have already discovered this problem,
and most already take steps to prevent it in text intended to be part of
a multiperson discussion.
Unfortunately, many web application developers don’t
realize that this is a much more general problem.
Every data value that is sent from one
user to another can potentially be a source for cross-site
malicious posting, even if it’s not an “obvious” case of an area
where arbitrary HTML is expected.
The malicious data can even be supplied by the user himself, since the
user may have been fooled into supplying the data via another site.
Here’s an example (from CERT) of an HTML link that causes the user to
send malicious data to another site:

In short, a web application cannot accept input (including any form data)
without checking, filtering, or encoding it.
You can’t even pass that data back to the same user in many cases
in web applications, since another user may have surreptitiously
supplied the data.
Even if permitting such material won’t hurt your system, it will
enable your system to be a conduit of attacks to your users.
Even worse, those attacks will appear to be coming from your system.

Fundamentally, this means that all web application output impacted
by any user must be
filtered (so characters that can cause this problem are removed),
encoded (so the characters that can cause this problem are encoded in
a way to prevent the problem), or
validated (to ensure that only “safe” data gets through).
This includes all output derived from
input such as URL parameters, form data, cookies,
database queries, CORBA ORB results, and data from users stored in files.
In many cases,
filtering and validation should be done at the input, but
encoding can be done during either input validation or output generation.
If you’re just passing the data through without analysis, it’s probably
better to encode the data on input (so it won’t be forgotten).
However, if your program processes the data,
it can be easier to encode it on output instead.
CERT recommends that filtering and encoding be done during data output;
this isn’t a bad idea, but there are many cases where it makes sense to do it
at input instead.
The critical issue is to make sure that you cover all cases for
every output, which is not an easy thing to do regardless of approach.

Warning - in many cases these techniques can be subverted unless you’ve also
gained control over the character encoding of the output.
Otherwise, an attacker could use an “unexpected” character encoding
to subvert the techniques discussed here.
Thankfully, this isn’t hard;
gaining control over output character encoding is discussed in
Section 9.5.

One minor defense, that’s often worth doing, is the "HttpOnly" flag for
cookies.
Scripts that run in a web browser cannot access cookie values
that have the HttpOnly flag set (they just get an empty value instead).
This is currently implemented in
Microsoft Internet Explorer, and I expect
Mozilla/Netscape to implement this soon too.
You should set HttpOnly on for any cookie you send, unless you have
scripts that need the cookie, to counter certain kinds of cross-site
scripting (XSS) attacks.
However, the HttpOnly flag can be circumvented in a variety of ways,
so using as your primary defense is inappropriate.
Instead, it’s a helpful secondary defense that may help save you in
case your application is written incorrectly.

The first subsection below discusses how to identify special
characters that need to be filtered, encoded, or validated.
This is followed by subsections describing
how to filter or encode these characters.
There’s no subsection discussing how to validate data in general,
however, for input validation in general see Chapter 5,
and if the input is straight HTML text or a URI, see
Section 5.13.
Also note that your web application can receive malicious cross-postings,
so non-queries should forbid the GET protocol
(see Section 5.14).

Here are the special characters for a variety of circumstances
(my thanks to the CERT, who developed this list):

In the content of a block-level element (e.g.,
in the middle of a paragraph of text in HTML or a block in XML):

"<" is special because it introduces a tag.

"&" is special because it introduces a character entity.

">" is special because some browsers treat it as special,
on the assumption that the author of the page really meant
to put in an opening "<", but omitted it in error.

In attribute values:

In attribute values enclosed with double quotes, the
double quotes are special because they mark the end of the attribute value.

In attribute values enclosed with single quote, the single
quotes are special because they mark the end of the attribute value.
XML’s definition allows single quotes, but I’ve been told that some
XML parsers don’t handle them correctly, so you might avoid
using single quotes in XML.

Attribute values without any quotes make the white-space
characters such as space and tab special.
Note that these aren’t legal in XML either, and
they make more characters special.
Thus, I recommend against unquoted attributes if you’re using
dynamically generated values in them.

"&" is special when used in conjunction with
some attributes because it introduces a character entity.

In URLs, for example, a search engine might provide a link within
the results page that the user can click to re-run the search. This
can be implemented by encoding the search query inside the URL. When
this is done, it introduces additional special characters:

Space, tab, and new line are special because they mark the
end of the URL.

"&" is special because it introduces a character
entity or separates CGI parameters.

Non-ASCII characters (that is, everything above 128 in the
ISO-8859-1 encoding) aren’t allowed in URLs, so they are all
special here.

The "%" must be filtered from input anywhere parameters
encoded with HTTP escape sequences are decoded by server-side
code. The percent must be filtered if input such as
"%68%65%6C%6C%6F" becomes "hello" when it appears on the web
page in question.

Within the body of a <SCRIPT> </SCRIPT>
the semicolon, parenthesis, curly braces, and new line
should be filtered in situations where text could be inserted
directly into a preexisting script tag.

One approach to handling these special characters is simply
eliminating them (usually during input or output).

If you’re already validating your input for valid characters
(and you generally should), this is easily done by simply omitting the
special characters from the list of valid characters.
Here’s an example in Perl of a filter that only accepts legal
characters, and since the filter doesn’t accept any special characters
other than the space, it’s quite acceptable for use in areas such as
a quoted attribute:

# Accept only legal characters:
$summary =~ tr/A-Za-z0-9\ \.\://dc;

However, if you really want to strip away only
the smallest number of characters, then you could create a subroutine
to remove just those characters:

An alternative to removing the special characters is to encode them
so that they don’t have any special meaning.
This has several advantages over filtering the characters,
in particular, it prevents data loss.
If the data is "mangled" by the process from the user’s point of view,
at least when the data is encoded it’s possible to reconstruct the
data that was originally sent.

HTML, XML, and SGML all use the ampersand ("&") character as a
way to introduce encodings in the running text; this encoding
is often called “HTML encoding.”
To encode these characters, simply transform the special characters
in your circumstance. Usually this means
“<” becomes “&lt;”,
“>” becomes “&gt;”,
“&” becomes “&amp;”, and
“"” becomes “&quot;”.
As noted above, although in theory “>”
doesn’t need to be quoted,
because some browsers act on it (and fill in a “<”)
it needs to be quoted.
There’s a minor complexity with the double-quote character,
because “&quot;” only needs to be
used inside attributes, and some extremely old browsers don’t
properly render it.
If you can handle the additional complexity, you can try to encode '"'
only when you need to, but it’s easier to simply encode it and ask
users to upgrade their browsers.
Few users will use such ancient browsers, and the double-quote character
encoding has been a standard for a long time.

Scripting languages may consider implementing specialized auto-quoting types,
the interesting approach developed in the web application framework
Quixote.
Quixote includes a "template" feature which allows easy mixing of HTML text
and Python code; text generated by a template is passed back to the web browser
as an HTML document.
As of version 0.6, Quixote has two kinds of text (instead of a single
kind as most such languages).
Anything which appears in a literal, quoted string is of type "htmltext,"
and it is assumed to be exactly as the programmer wanted it to be
(this is reasoble, since the programmer wrote it).
Anything which takes the form of an ordinary Python string, however,
is automatically quoted as the template is executed.
As a result, text from a database or other external source is
automatically quoted, and cannot be used for a cross-site scripting attack.
Thus, Quixote implements a safe default -
programmers no longer need to worry about quoting every bit of text
that passes through the application (bugs involving too much quoting
are less likely to be a security problem, and will be obvious in testing).
Quixote uses an open source software license, but because of its
venue identification it is probably GPL-incompatible, and is used by
organizations such as the
Linux Weekly News.

This approach to HTML encoding
isn’t quite enough encoding in some circumstances.
As discussed in Section 9.5,
you need to specify the output character encoding (the “charset”).
If some of your data is encoded using a different character encoding
than the output character encoding, then you’ll need to do something so your
output uses a consistent and correct encoding.
Also, you’ve selected an output encoding other than
ISO-8859-1, then you need to
make sure that any alternative encodings for special characters
(such as "<") can’t slip through to the browser.
This is a problem with several character encodings, including popular ones
like UTF-7 and UTF-8; see Section 5.11
for more information on how to prevent “alternative” encodings of characters.
One way to deal with incompatible character encodings is to
first translate the characters internally to ISO 10646 (which has
the same character values as Unicode), and then
using either numeric character references or character entity
references to represent them:

A numeric character reference looks like
“&#D;”, where D is a decimal number, or
“&#xH;” or “&#XH;”,
where H is a hexadecimal number.
The number given is the ISO 10646 character id (which has the same character
values as Unicode).
Thus &#1048; is the Cyrillic capital letter "I".
The hexadecimal system isn’t supported in the SGML standard (ISO 8879),
so I’d suggest using the decimal system for output.
Also, although SGML specification
permits the trailing semicolon to be omitted in
some circumstances, in practice many systems don’t handle it - so
always include the trailing semicolon.

A character entity reference does the same thing but
uses mnemonic names instead of numbers.
For example, "&lt;" represents the < sign.
If you’re generating HTML, see the
HTML specification which
lists all mnemonic names.

Either system (numeric or character entity)
works; I suggest using character entity references for
“<”, “>”,
“&”, and “"”
because it makes your code (and output)
easier for humans to understand. Other than that, it’s not clear
that one or the other system is uniformly better.
If you expect humans to edit the output by hand later, use the
character entity references where you can, otherwise I’d use the
decimal numeric character references just because they’re easier to program.
This encoding scheme can be quite inefficient for some languages
(especially Asian languages); if that is your primary content, you
might choose to use a different character encoding (charset), filter
on the critical characters (e.g., "<")
and ensure that no alternative encodings for critical characters are allowed.

URIs have their own encoding scheme, commonly called “URL encoding.”
In this system, characters not permitted in URLs are represented using
a percent sign followed by its two-digit hexadecimal value.
To handle all of ISO 10646 (Unicode), it’s recommended to first translate
the codes to UTF-8, and then encode it.
See Section 5.13.4 for more about validating URIs.