FORM submission and i18n

(Ad-hoc tutorial-ish
notes)

This is a complex area, not made any easier by
browser bugs and oddities.
This is one of the i18n topics briefly covered in the W3C's
i18n presentation to the 1999 Unicode conference, starting at Forms: The i18n problem.
The present page, although not making any claim to being a
complete survey, deals with a number of practical issues,
and looks at some of the principles behind them.

By now (2005) the robust approach is to send
out forms pages encoded in utf-8, expecting the forms input to be
submitted back using that encoding. This has been in practical use
for a couple of years now (e.g at Google) and can be expected to
work with any current HTML4-compatible browser.
However, there are other browsers still in use which don't fit
that description, so it still seems relevant to look at the theory
and compare it with observations.

Snippets
reported about browser behaviour and search engine options etc. have
been collected at various different times, and there tend to be
frequent changes. So please use them only as specimens of what
can happen (aka what can go wrong).

The present page only aims to cover issues which are of practical
relevance and utility at the time of writing;
it makes no attempt to cover future developments: those interested
might want to read Working Draft for Web Forms 2.0.

Background to i18n in FORM

As you see, the default content-type is
application/x-www-form-urlencoded, and this is the only
content-type available when the method is GET.

When the method is POST on the other hand, a second
content-type will be available (except on antique browsers), namely
multipart/form-data.
Support for both is mandatory in
browsers which support current versions of HTML.
(Support for other form-submission content types is optional, and
therefore shouldn't be relied on by authors).

In principle, the server can inform the
client of which character
codings it is willing to support in submitted forms, by using the Accept-charset attribute on the HTML
form tag. (Browser support for this still
seems to be quite patchy:
do not confuse it with the issue of a client
which sends an HTTPAccept-charset header to the server
to tell it what document character encodings it is willing to receive:
there is certainly a fair amount of browser support for that,
but it isn't the issue under discussion here)

We comment later on implications for the compatibility of these
options with some general features of WWW protocols, such as
idempotence.
For now, we concentrate on i18n issues.

application/x-www-form-urlencoded

Theory

This is the only content-type available when the method is
GET; it is also the default content-type for method
POST.

According to the HTML4.01 specification, the only characters that
you are entitled to rely on in this situation are those
of us-ascii, i.e the 7-bit repertoire.

Realistically, however, browsers and other client agents do not
enforce this restriction, and will typically handle
characters outside of that repertoire
by applying the same %xx hex coding that
they apply to unsafe characters of the us-ascii repertoire.
But this is not unproblematical, as we will see.
Nevertheless, as an author, this isn't under your control:
readers can and will submit extended characters - there's
nothing you can do to stop them - so your server-side
scripts need to be able to do something with them - if only
to recognise them and politely refuse them (but preferably something
more constructive).

The theoretical problem is that when the form is submitted,
the server normally receives
no indication of which character encoding (charset)
the client thinks it is using.
Thus the server will receive some %xx-coded
representations of octets (bytes):
but without knowing what charset to apply,
it cannot unambiguously interpret these codes in terms of characters.
But see the next sub-section about actual practice.

This seems to be inevitable with method GET, for which
the protocol does not provide a defined way for
a charset attribute to be conveyed with the
submitted form contents.
With method POST, on the other hand, the request contains an
entity body, and the HTTP protocol (see RFC 2616 section 7.2.1) says the request should have a
Content-type header. Klaus Weide cites RFC2070
for the client to put a charset attribute value here,
as in this example:

Content-type: application/x-www-form-urlencoded; charset=koi8-r

See RFC2070 section 5.2 for details.
However, experience shows that many poorly-written server-side
scripts would be confused by this: a typical compromise
chosen by browser developers is not to to send
the charset attribute if it's identical to the
encoding of the page from which the form is being submitted.

Some authors, when designing a form which could be sent out
with different encodings, will include in their form a "hidden
value" which will be submitted as part of the form data, to remind
the server of which character coding was in effect.
But of course this assumes that the browser is always submitting
with the same character encoding as was used for the original page:
if the user managed to override that (e.g by manually setting a
different character encoding in their browser), then that kind of
hidden field would be worse than useless.

Practice

In practice, browsers normally display the contents of text fields
according to the character encoding (charset) that applies
for the HTML page as a whole; and when it submits the text fields
they are effectively in this same coding.
Thus if the server sent out the page containing the form with
a definite charset specification, it could normally
assume that the submitted data can be interpreted in accordance
with the same charset, and this is, at heart, what
actually happens.
There are however anomalies of various kinds,
some of which have been seen and understood by the author of this
note, some of which have been seen and not understood, and some
of which are only anecdotal.

In addition to these considerations, some users may be typing-in or
pasting-in text from an application that uses their local character
coding (practical examples being macRoman on a Mac;
or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC),
into a text field of a document that used the author's -
different - character encoding (let's say for the simplest
example, iso-8859-1): the user might then submit the form,
disregarding that what they are seeing in the text field is not what they
intended to send. From anecdotal evidence it appears that some folks
analyzing survey responses expected %xx-representations
of 8-bit-coded characters, but sometimes got clusters of
%xx-representations which turned out to be utf-8 instead:
whether this would have been evident or not to
the person doing the submitting was unclear.

Another commonly observed behaviour on Windows platforms is
using a form which is in an iso-8859 coding, but
the user pasting in characters
(such as clever-quotes, trademark, euro sign etc.) which only exist in the
corresponding Windows coding, e.g for Latin-1 the codings would
be respectively iso-8859-1 and Windows-1252;
in the iso-8859 encodings, these character positions do not
represent displayable characters (they are in a
range reserved for control functions). Some browsers disregard the
mismatch and simply submit the character as the corresponding
%xx code in the range %80-%9F,
as if the browser thought it was
handling the Windows coding instead: some replace these inappropriate
characters by some kind of useful (e.g clever-quote replaced by
plain quote) or useless (e.g all unrepresentable characters replaced
by question-mark) substitute; for MSIE5's surprising behaviour
see later in this page.

So...

It would therefore have to be concluded that, all other things being
equal, this form submission content-type should be avoided for
serious i18n work.
However, all things probably aren't equal, and in particular it's
better to perform searches and other idempotent transactions using
the GET method.
So you find various heuristic methods being used.
Fortunately, in reasonably recent browsers (this written
as of 2005), in practical terms the GET method can be used successfully
if the form is sent using utf-8 encoding, even though,
theoretically, this lies well outside of what the HTML specification
says we're entitled to rely on.

HTML sent out without a charset attribute?

N.B it's strongly recommended in all instances to send out
HTML (and other text-type media) with the correct charset
attribute on their media type ("Content-type" attribute in HTTP).
(See also the
notes to CA-2000-02.)
This is even more-important when dealing with forms.
Nothing that is said here should be interpreted
as encouraging the omission of the attribute: the
discussion here is only about the possible consequences
of its omission.

If a document is sent out without an explicit charset
specified, then it will (typically) be handled by the browser using whatever
default character encoding has been selected by the
reader.
In the various browser/versions that I have tried, this is true not
only for displaying the document to the reader, but also in processing
any forms input.

An even more exciting possibility is that the browser defaults to
guessing the encoding, based on the page's
contents.
From some anecdotal evidence, there's a suggestion that MSIE
can revise its guess, depending on what the user pastes into
their form submission fields!

Of course, as you'll see from the discussion elsewhere, the server
usually gets no direct indication of
what character encoding this was.
Clearly, it is inadvisable to work in this way:
the content should be sent with its character encoding
(charset) explicitly defined, and (except in situations
where there is only one natural character encoding needing to be
supported), preferably some kind of mechanism supported to allow
users to get documents in their preferred coding in order to
facilitate proper forms input (unless you decide to support only
utf-8 submission, with a polite note to users of
older browsers).

multipart/form-data

In this coding, the browser constructs multipart packages of
the "successful form controls", according to the principles of
MIME encoding.
This gives the opportunity for formal support of the full
unicode character repertoire; and for the server to notify the
client of which character encodings it is willing to accept
(accept-charset) and the client to include, in
the MIME-packaged submission, details of which character encoding
it is using for the submission.

At this point I must admit I have not conducted an extensive
study of browser coverage of these features.
Ian Graham comments in a usenet posting that browsers don't actually
specify the charset of the MIME parts.
Other discussion suggests that the situation is similar to what we
described above for application/x-www-form-urlencoded:
there are so many poorly-coded server-side scripts out there which
can't cope with the presence of this attribute, that browser
implementers are inclined to leave it off if it's the same as
the HTML page from which the form was submitted.
They reserve the explicit sending of an encoding (charset=)
for when a form is submitted in a different encoding - for which the
form implementer has to take some specific action, e.g specifying
an Accept-charset on the HTML form,
which may be understood as a signal that they are willing and able
to deal with the consequences in their server-side script.

By the way, this is also the advertised way to support
file uploading.
Which again can (and probably should) involve a correct
specification of the character encoding when text-files are
involved.

Techniques (discussion)

Hidden "buzzword"

The author can add into the form a carefully-crafted
"hidden" field which contains a number of diagnostic characters.
When this field is submitted, the server can investigate the format of
what has been submitted, and reach some conclusions as to what coding
the client software was using.
This technique, which I knew as the "buzzword"
in other contexts before,
has been suggested by a number of people in relation to the current
problem, for example Jukka Korpela (in email) and the W3C i18n
tutorial already cited.

Ingenious indeed, and this should be able to compensate for a
range of possible behaviour within the client software itself.
It cannot, of course, do anything about mis-handling of characters
that are being pasted into the form from elsewhere.
But it seems to me to be well worth considering, especially in
contexts where several different encodings are in active use
(an example might be Russian Cyrillic, where at least three
different 8-bit encodings have been widely used).

Heuristic recognition of utf-8?

Valid utf-8 characters consist of either individual
bytes/octets with their
top bit unset, representing a single us-ascii character; or a
sequence of n bytes with their top bit
set, in which the value of n can be
determined by inspecting the upper bits of the first byte
of the sequence.
There are some more-detailed checks that can be done on the
values of the octets (read more about utf-8 in Markus
Kuhn's pages or in RFC2279.)
If the text does not fit this pattern, then it cannot be utf-8:
if it does fit this pattern then it might or might not be utf-8,
but the more (non-ASCII) data you get which still
fits the pattern, the less likely it is to have happened by chance.

The above is true (you'll see it mentioned in the just-cited
RFC2279 too) for any writing system, not only for Latin-based
scripts.

In "Latin" script, at least, it would be rare in
normal text to have clusters of accented letters: it's very
likely that, somewhere in the text, an individual accented
letter will appear with a us-ascii character on each side of
it. Such a sequence in an 8-bit coding could not be mistaken
for utf-8.

If the characters are from the Latin-1 repertoire, then their
utf-8 sequence will consist of two octets, of which the
first will appear to be either Â (A-circumflex) or Ã
(A-tilde) if they are (mis)interpreted as iso-8859-1.

If the text contains no bytes with the high bit set at all,
then it can be treated as us-ascii, and it matters not whether we
label it utf-8 or iso-8859-anything, since they're all the same
as far as us-ascii texts are concerned.

Beyond that, the only thing one can say for sure from the
character-encoding point of view is that if even a single unit
fails the utf-8 check then the document cannot be a valid utf-8
encoded document.
If it passes the check, it's no absolute proof that the
document is utf-8 encoded: in the absence of some authoritative
reason to believe that it's utf-8, this would have to be assured by
heuristics, such as verifying that the content makes some kind of
sense in its intended language etc.
Some of the browsers (e.g Mozilla) have rather good routines for
guessing character encodings, given an appropriate source of material
to work on; but if they are fed something less suitable, they can
come up with bizarre answers.
And if the submitted material is not under proper control, it
could have been cobbled together from sources that were
in different character encodings, meaning that basically "all bets
are off".
At least it should be evident to whoever pasted the
textarea for submission, that something is wrong with their input
before they submit it (we're not talking about uploaded files here,
of course, which are a different matter).

In that analysis, I've disregarded utf-7 format (which would be wrongly
identified as us-ascii), as being inappropriate for use in an HTTP
context.
One might mention, however, that when MSIE is set to auto-detect
character encodings, it has been known to mis-identify some
us-ascii pages, claiming them to be in utf-7.

The _charset_
"hidden field" browser feature

Jon Warbrick calls my attention to a long-standing feature
of MSIE, of which I was previously unaware,
and which has evidently also been implemented in Mozilla
(but not, it seems, in Opera 8.5).
If the form contains a hidden field which has been named
_charset_ (note the leading and trailing underscore
characters as part of the name), then the browser will fill-in
the submitted character encoding as the field's value.

Some of the web pages which discuss this feature suggest that
it will only be actioned if the form specifies
an Accept-charset attribute, but this doesn't seem
to be accurate.
At any rate, of course, this "feature" is not a defined part of
the current form submission protocol, and so it would be
unwise to rely on it, but it could nevertheless be
incorporated as a useful clue, to be used if found to be present,
but including some other strategy for browsers which don't do it.

General WWW issues

Idempotence

The method GET is defined to be apt for idempotent
transactions (transactions which may be repeated without causing
harm), and it is
recommended explicitly in the HTML specification as "ideal" for
use with applications such as searches.
Unfortunately there are other considerations to take into account,
for example implementation limits on the length of URLs; for
example site-providers' desire not to have the parameters of a
query visible in the URL window, and so on.

Thus, in spite of the "other things being equal" advice, which is
definitely good, of using method GET when the transaction is
idempotent, there may sometimes be supervening reasons which indicate
the use of method POST.

Search Engines

These examples have been pointed out by Andreas Prilop, who is
maintaining some pointers to multilingual search engine facilities on his
Multilingual
Macintosh Resources page.
However, the details change with time, so some of the details shown
here may have got out of date by the time that you read them.

By 2005, support for utf-8 encoded forms submission is much improved,
both in terms of browser support and in terms of indexing at the search
sites.
The need to make queries in a specific 8-bit character code,
chosen according to language group or region,
is fading away, except for obsolete browsers
such as Netscape 4.* versions.

All the Web used to offer from their advanced search
menu a wide choice of browser character-encoding
settings (denoted "character set" on their menu).
However, as of 2005 these options had vanished
from their web query page, and
I could find no mention of them in their query guide.
On their News search page, on the other hand,
an encoding option was present (but without any annotation or
explanation), and, when a particular query encoding was selected, the
new query page reflected the selected encoding in its own encoding,
as you'd expect from the preceding discussion.

However, Andreas had evidence that to be successfully found by
AllTheWeb, documents need to have their coding specified by a
META...charset specification
contained in the document (a pity,
because other considerations strongly favour the use of a real
HTTP header for this purpose, rather than a META
within the document).

AltaVista offers its users the opportunity to use a
customised search page. Refer to this
introductory page at AltaVista
or follow the "Custom Settings" option from their main search page.

When this note was originally written,
the URL for a character-encoding-specific query page looked like
http://www.altavista.com/?enc=iso88592
(in this case for iso-8859-2), and you could bookmark several
different codings which you wanted to use.
As an inspection of the customised search page showed, Altavista then
used a normal GET query, accompanied by a "hidden field"
to remind Altavista of the character encoding which you had
configured.

Andreas reported that Altavista handled these codings properly, in
the sense that:

a search term specified in iso-8859-2
finds also pages coded in Windows-1250, where "s with caron" is
not xB9 but x9A.

Finding strings in utf-8 coded documents is also supported.

(At that time, some other search engines indexed the data only as
a "bunch of bytes", meaning that one needed to search several
times with different encodings - a nightmare for Cyrillic, where
several incompatible codings are widely used.
By 2005, most of the successful search engines have resolved this
problem: a single query can find all occurrences of the terms, no
matter how they were encoded in the indexed documents -
provided, of course, that they were encoded correctly, and the
correct encoding was sent out from their server when the indexing
bot retrieved them.)

By 2005, the details of character coding had disappeared from
their custom search configuration, in favour of general use of utf-8.

The URL http://www.altavista.com/
has been observed to produce different results according to where
the client appears to be located: the server may be guessing at
the user's preferred language and/or responding to the user's
language preference setting -
the details appear to change from time to time, so I'm not
attempting to describe their behaviour in detail here.
The URL http://altavista.com/ also responds (with
a redirection), but the redirection may or may not produce the
same result as the http://www.altavista.com/ URL
itself.
It's all rather confusing.

Language tools are available
(in this case the English version of them).

Google's earlier support for a variety of 8-bit codings seems to have
faded away by now (2005), in favour of general use of utf-8.
There's some indication that users of older browsers may be offered a
different user interface. Again, these details change with time and I'm
not able to keep these notes continuously updated.

Writing direction (rtl, ltr)

I have a separate page about
text-direction; on this page I just make some points about
forms submission.
Well, I don't say anything myself, but I quote the comments from
A.Prilop, who includes links to searches in an rtl
language in his Hebrew links page.
He writes:

Comments on searches

The fact that this kind of mechanism is being offered by
several search facilities, seems to indicate that
the providers feel that this works well enough, in the browsers
that their audience will be using.
Some features don't work in Netscape 4.* browsers, which is no
surprise on account of its behaviour described elsewhere on this page.

When their multi-language query page was supporting alternative
query encodings,
both Altavista and "All the Web" put this selection close to
a language-selection filter, as if they were coupled.
However, neither of them really required you to make a
language selection as such, if you only wanted to specify
a query encoding.
But, as I say, their use of alternative character codings seems
to have faded away now, in favour of using utf-8.

Browsers

The rest of this note describes some tests with
browsers.
I'm afraid there hasn't been time to keep them continuously
up to date, so the selection of browser versions is quite erratic.

Win MSIE (various versions)

Some fun with Win MSIE5

This report is all based on submitting using method GET and the
default form submission encoding. The same results were found using
INPUT TYPE=TEXT as with TEXTAREA.
Tests were with MSIE5 on Win
platforms (I didn't see any difference between Win95 and NT4).

It's already been pointed out that the published specifications
only define the behaviour for
us-ascii characters, so, strictly speaking, no-one can complain about
what happens. But nevertheless.

If I pasted the Windows matched-quotes into a form within an HTML
document that had charset=windows-1252, then they went into the raw
query string as %91 %92 %93 and %94 , which indeed are the %xx-codings
of the matched quotes in codepage 1252. So far, so good.

If I did the same thing with the HTML document in charset=utf-8, then
what got submitted were %E2%80%98, %E2%80%99, %E2%80%9C, %E2%80%9D,
which are indeed the %xx-codings of the correct octet sequences for a
utf-8 representation of the unicode characters U+2018 U+2019 U+201C
U+201D. So that's behaving as expected.

However, the fun starts if I try submitting a form that's in
charset=iso-8859-1 with this browser.
What then turns up in the raw submitted string is this
(taking just one example from the four):

%26%238220%3B

Applying the %xx-decoding to that, we find that it reads

&#8220;

in other words, a completely unsolicited HTML-isation has been
performed on this input character. The result of submitting that
single character is then totally indistinguishable from what happens
if one types the character string "&#8220;" (without the quotes of
course) into the text field. Both of them produce %26%238220%3B in
the raw submitted string.

My argument against this piece of DWIM-ery is that the specification
of the forms url-encoding format contains no reference whatever to
HTML notations. Url-encoded forms submissions
might be used for submitting plain text, CSV data,
or all manner of other stuff: it's mere happenstance that it's sometimes
also used as a means of submitting HTML source code.
Some correspondents have argued that what MS is doing here (and, as
we will see, Mozilla went on to do the same) is to provide
useful extra functionality for a situation that lies outside of the
existing specifications, and, without which, these characters would
have to be rejected from the submission.

Well, "so far, so good". But my argument (if I hadn't already
"missed the boat" on this) would be that once such HTML-ification
has occurred, it's impossible to know whether the submission is
an attempt to submit a single Unicode character, or an attempt
to submit the character string &#number;.
My argument would be that the existing urlencoding specification is
based upon %xx encoding (xx being two hexadecimal digits),
and is defined in an unambiguous
way, since, if the % character is meant to be taken
literally, the character itself gets %xx-encoded for submission.
I would argue that, if an extension was wanted, it would be better
to base it on this same mechanism.
For example, by defining a
(hypothetical!) format %{xxxxxx}
for encoding Unicode characters by means of a variable number of
hex digits.
Under the existing specifications, that string never gets sent to
the server: so such an extension could be comfortably defined
without ambiguity. In order to send such a character string as
data, then the normal url-encoding rules already do
call for the % character to be url-encoded in the
usual way, and that would not change.
However, as I say, this kind of proposal appears to have missed the
boat, since the actual browsers out there have been doing what they
do, for quite some time already, while the specifications (or rather,
the gaps in the specifications) on which they are based, have not been
developed to address this issue as such.

Note:

An email correspondent writes to point out that after applying
the security fix bundle Q824145,
MSIE5.5 was found to have stopped applying the above "HTML-ification":
the change may affect other IE versions, he only had tried 5.5.
Of course, in a WWW context you cannot rely on users to apply
fixes promptly, nor even "at all", not even "security fixes"; so
server-side scripts would still need to do something sensible
in both cases.

"Always send urls as utf-8"

MSIE5.* (maybe others too) have on their "Tools-> Internet Options->
Advanced" menu an option shown as "Always send urls as utf-8".
As far as I can make out, this option relates to sending URLs which
contain non-ascii characters, e.g from a URL dialog box or HTML source
code. It does not appear to have this effect on
forms submission of text strings, which still behave as described above
when the option is turned on, at any rate
in the browser versions tested.

Ed Batutis writes to comment on this point:

This applies to the 'resource' part of the URL only, not the query part. URLs
for many Asian-language sites are a horrible mess - it is easy to find links
with raw multibyte characters in URLs (not url-encoded). If you thought that
just form data could be totally screwed up by character encoding issues, in
these cases you can't even navigate the site if things go awry! Typically the
only way things can hang together in this arena is if the browser schleps the
bytes through without changing them in any way. But you can imagine the
problems that arise. So, "Always send urls as utf-8" attempts to cut the
Gordian knot - and it works as long as the server is expecting this. It
probably should be a server-specific setting, however, since the server has to
have code that figures out what is going on. It seems to work nicely on IIS and
some versions of Apache...

But the short answer is that this section isn't really relevant to
the formatting of forms submissions, so it was a bit out of place here.

Yet more fun with MSIE6

MSIE6 seems to me to continue the pattern set by IE5.5:
for characters which cannot be represented in the relevant
character encoding, it submitted the %xx-encoded
representation of &#number; (decimal),
except for Latin-1 characters, where it
submitted the %xx-encoded
representation of &entity;.
However, there are reports which suggest this isn't always
the case.

For a form that doesn't specify an encoding, just plain
method="GET" (although the HTTP headers given by the
server say utf-8), IE6 uses "%u017E" in the url rather than
%C5%BE. I think they're just trying to be difficult...

Well, I then did a web search for related symptoms, but found relatively
little. This SecurityFocus article makes mention of the format; this www-i18n mailing list thread seems to talk about IE6 but without,
as far as I can see, mentioning the use of %u format.

Further IE bugs

A Usenet posting (in German)
reports a situation in which pasting a Euro character into a text field
resulted in the browser omitting one of the other fields
(a hidden field, in this case) from the submission.
I was able to confirm this (mis)behaviour myself in Win IE5.5 using the
same source but a different reporting mechanism (to rule out any
problem with the original PHP reporting).
The posting sets out the details of the environment in which this
misbehaviour was observed; as yet we don't know how widely the
misbehaviour could be reproduced.
The document coding was iso-8859-1, which of course does
not contain the Euro character: but, as the poster remarks, it's not
possible to limit the characters which one's users are typing or pasting
into a document.

And an email correspondent who had located the previous paragraph,
informs me in Feb.2004 that a very similar
misbehaviour was observed with all of Win IE5, 5.5 and 6 "with the most
recent hotfixes applied", under a particular range of circumstances.
It's confirmed that the problem is specific to the
multipart/form-data submission enctype.

In June 2004 I got an email from Christian Gosch, describing a complicated
problem observed with IE6 (SP1) in relation to
multipart/form-data submission format.
Under certain circumstances, "the first boundary and the
directly following static text block were missing", as indeed had
been correctly reported by their server process on receiving this
defective form submission from IE.
My informant listed a number of issues found to be implicated
in the problem, despite the fact that they should have been irrelevant:
"non-ISO-8859-1 characters" being present in the submission,
the presence of hidden fields at the end of the form, and so on; and
changing these details caused the problem to disappear and reappear
without obvious rhyme or reason.
I'm hesitant to go into any great detail here, as it might
turn this page into a detailed bug report rather than the overview
which it's aiming to be.
At any rate, the observation could be seen as an alert that all is not
well with the browser's behaviour in this area.
A search for the subject containing
Fehlerhafter POST request mit EUR-Symbol
in microsoft.* usenet archives for June 2004 should bring
forth a thread (in two parts in Google
Groups) discussing this bug in German.

IE6 was tested for submission of Plane-1 characters
(i.e those above 65535), but it was
impossible to paste them into the form field,
so the test could not be carried out.
This test should be tried again when the Plane-1 fix
described here by i18nguy has been applied.

This browser of course supports the ad hoc _charset_
hidden field (it seems to have been originated by MS, after all).

Mozilla (and FireFox)

In response to earlier discussion, Ian Graham called attention to RFC2718,
and noted some incompatibility with the existing
heuristic behaviour of browsers.
He cited a Bugzilla
report relating to this issue.

However, that has now been overtaken by events, and Mozilla
(I tried version 1.1) is behaving like MSIE5 used to do (whereas
MSIE5.5 went on to do something subtly different, as already noted).
Mozilla's changed behaviour has been
called in as bug 135762, by Markus Kuhn indeed;
furthermore the discussion has revealed a shortcoming of Bugzilla,
in that it omits to define a character encoding for its own report
pages, tries to store its data as raw bytes, and thus cannot be
trusted to display to the reader the same information as was
submitted by the bug-reporter: oops.

Messy, isn't it?
This whole area needs to be kept under continuous review.
It also has significant impact on the search engine services
such as Google, when i18n content is involved.
As we see, however, Google has moved rapidly to exploit utf-8
both for submission of queries and for presentation of results.
Similar developments are seen in other search engines too.

Mozilla seems to submit a no-break space as a normal space (i.e
as a + sign in the raw submitted string).

Mozilla version 1.7.12 was tested for Plane-1 characters, and
it submitted them consistently for utf-8 (e.g %F0%90%80%80)
as well as for various 8-bit encodings (e.g %26%2365536%3B).
The same applies, not surprisingly, to FireFox 1.5.

Both of the just-mentioned browsers supported the ad hoc
_charset_ hidden field.

The above observations on Windows Mozilla 1.7.12 were later confirmed
with Mozilla 1.7.12 on Scientific Linux 3.0.3, including successful
use of Plane-1 characters (even though no font was available for them!)

Opera 6

Forms input in utf-8 was handled fine.

Submitting with 8-bit-coded forms
basically worked, provided that the characters were within
the repertoire of the selected coding.
If the character can't be represented in the indicated coding,
Opera 6 transmits %3F (i.e a coded question-mark),
with the following exception.

If the page is using iso-8859-1 coding, then characters
in the "Windows area" (128 to 159 decimal)
got submitted as if the coding was Windows-1252 instead,
i.e it transmitted representations like %93:
I am told that this wasn't really their intention, and that
it would be fixed in a subsequent release.

I was subsequently sent some details of the internal workings
and can thus add:

Opera supports the accept-charset attribute on the
form element;
if utf-8 is in the list, then it is used; otherwise,
at present, the first charset on the list will be used.

In the absence of an accept-charset on the form,
the coding used for
submission is the same as the coding of the page itself,
as is normal in other browsers.
On Windows, if no coding is specified, then Windows-1252 is
used, rather than iso-8859-1, by default - this can be
changed in the Preferences/Languages dialog;
(Linux Opera uses iso-8859-1 as
its default, on the other hand).

When POST is used, the posted content will only have a
charset added if it is different from the charset of the
page itself.

Opera 8.5

With utf-8 encoding, Opera 8.5 submitted characters as expected,
including tests of characters in Plane 1.

The behaviour of Opera 8.5 with respect to 8-bit encodings
is very similar to Mozilla or MSIE5:
if the character can be represented
in the 8-bit coding then it is sent as its %xx
representation; if it cannot be represented, then it is
submitted as %26%23number%3B, the urlencoded
representation of its &#number; numerical character
reference in HTML.
This did not seem to work for the test characters in
Plane 1, however, which were submitted as %3F%3F (pairs of
questionmarks).

Unlike recent MSIE versions, it does not submit
the character entity names of Latin-1 characters.

It submits no-break space as %A0.

It does not support the ad hoc _charset_ hidden field.

Konqueror

Version: 3.1.3-5.8 Red Hat, on Scientific Linux 3.0.3.

utf-8 encoding: behaved pretty much like Mozilla.

8-bit encoding: characters which could be represented in the encoding
were submitted as %xx; characters which could not be
represented in the encoding were submitted as %3F i.e
as question-mark.

No-break space was submitted as itself, not converted to a normal space.

It did not support the _charset_ hidden field.

Plane-1 characters could not be pasted into the submission field, and
thus could not be tested.

From a page coded in iso-8859-1, as soon as I
paste "clever" quotes into the form they turn into regular quotes,
and the submission (irrespective of the charset of the form) contains
of course the %27 and %22 representations of the plain ascii
characters. In a WWW context this seems to me to be quite reasonable.

Text areas and utf-8

Netscape 4.* versions don't seem to be able to use forms text-input
in any meaningful fashion when utf-8 coding is in use.
It's true that Latin-1 characters can be typed-in (or pasted
in from other windows that are using iso-8859-1 or windows-1252
coding), but that isn't particularly useful, after all, because if
you only wanted Latin-1, you wouldn't be likely to choose utf-8 coding.

The descriptions below are couched in rather sloppy terms, but
should just give a flavour of how wrongly it all behaves.
(Specific examples below relate to the Windows versions of NN4.*,
but I've no reason to suppose the Mac and X versions are any better
in this regard - maybe even worse.)

Paste from utf-8 coded NN into utf-8-encoded form

If I copy some normally-displayed text from a utf-8 coded
Netscape window, into a utf-8 coded text area in Netscape. then what
I see in that text area before submitting the form is that each
non-us-ascii character is turned into a bunch of Latin-1
characters. Or to put it another way, the byte-sequence (two or
three bytes) representing each utf-8 coded character, is displayed
as if each byte were really a Windows-1252 character, rather then being
one of the bytes in a utf-8 byte-sequence.

So far, so bad! But on submitting the text, it gets even worse, because
each of those bytes then gets coded-up according to the rules of
utf-8 coding. In a sense they have now been "doubly utf-8 coded".
On receipt at the server, provided they were interpreted as utf-8-encoded
characters, they would be interpreted as that sequence of coded
Windows-1252 characters which we saw displayed before submitting
the form.

This is clearly useless! Although one could (knowing the
circumstances) deduce what the original characters were, it would
be pointless to try to deploy this, since the user of the
form has no way of verifying what they are typing into the text area.

Paste from MSIE unicode window into NN4.* utf-8-encoded form

The non-Latin-1 characters get displayed by NN4 as "?", and
submitted as such. Of no practical use whatever.

8-bit non-Latin-1 keyboard input to utf-8-encoded form

Suppose, for example, we switch the keyboard into Russian locale,
and start typing-away into a text area of a utf-8 form in NN4.*.
What we see in the text area are Latin-1 characters!
On submission these are, not surprisingly in terms of what was said
before, then 'properly' coded into utf-8, and at the server (when
decoded) will appear to be precisely those Latin-1 characters that
were seen in the text area (not the Cyrillic characters that
were typed on the keyboard).

Again, this seems to be of no practical use in a WWW context.

Conclusion for NN 4.* versions

For someone developing, say, a multi-script bulletin board,
it is clearly not feasible to use NN4.* in this
way as an input medium via a utf-8-encoded form.
Although NN 4.* is perfectly capable, when properly
configured, of displaying the
utf-8-encoded content, it could not be used in this way for input.

NN 4.* is reputed to be usable (with some caveats which we won't
tangle with here) for input of non-Latin-1 scripts when the form
uses an appropriate 8-bit coding, and (with appropriate code mappings
being used at the server) this could be used by a
suitably-motivated developer to support the input of portions of
text in different scripts (but only one 8-bit repertoire per
submission). The resulting mixed documents could perfectly well
be displayed on NN4.*.
One could be excused however for concluding that this browser
version is inadequate for the purpose, and not worth
the effort of supporting in such an application.

emacs-w3 oddity - a report

RISCOS Oregano

As an example of the sort of thing that can go wrong with minority
platforms, I'm including a summary of a report from Matthew Somerville.

The browser behaves as if it's submitting Latin-1, no matter what
character encoding the page itself is in (for example character 192
will be A-grave regardless, and will be submitted as such).
If "extended" Acorn characters (e.g 148) are input, they aren't
displayed properly, but they are submitted.
(These extended characters are reportedly not
identical to those used
by Windows-1252; clearly it would be unwise as a user to submit these,
but as a script implementer one should be aware that they can
nevertheless appear in submitted data).

Forms in a page in us-ascii

I have to admit that in the original tests, I had not thought
to try submitting the form from an HTML page whose character
encoding (charset) had been explicitly given as
us-ascii.
I later remedied this, for the then-available versions of
Mozilla and MSIE6, but older browsers haven't been tried.

Mozilla and MSIE6 behaved just as described above for other
character encodings: submitted characters which were outside of
the range of the character encoding (i.e in this case,
us-ascii) were represented as %xx-encoded
representations of &#number; (decimal), except that
IE6 represented Latin-1 characters using &entity;
instead.

Other resources

There's a presentation of some character submission encoding
problems which they experienced in various browsers, in the Wikipedia Help.

As has been noted in the previous discussion to Bugzilla bug #135762,
the Bugzilla database has been
allowed to develop with a mix of different submitted
encodings, without any kind of labelling in the database, which
seems to mean that no kind of automatic rescue of the existing data
would now be possible: a solution can be devised for future submissions,
sure, but if anything is to be done for the existing content, it would
need a tedious and error-prone editorial trawl through all of the
data.

The lesson to be drawn by anyone who is proposing to set up an
i18n-capable forum on any kind of scale, should be fairly obvious:
get this sorted out in the original design
before you start accumulating content -
don't leave it until the faults become evident, and it proves to
be impractical to repair the previously-submitted content.

Recommendation as of 2005

It is evident that, with reasonably current browsers available
as of 2005, the best results are achieved by submitting forms from
an HTML page whose encoding is utf-8, and this is
confirmed by its widespread usage in search engine query pages etc.
at this time.

However, this doesn't work with certain older browsers, most
notoriously Netscape 4.* versions for characters outside of
(broadly speaking) the Latin-1 repertoire; if anything more
ambitious were needed for those older browsers,
it would be necessary to offer users
a choice of relevant 8-bit encoding(s), with users guided
to the choice of an encoding appropriate to the language script
which they intend to submit.
Pretty much, in fact, what the search engine query pages were doing,
some years back.
But, considering that the popular search engines no longer seem
to consider it worthwhile to offer this kind of option (not even
when they are called from NN4), you might feel it wasn't worth
the effort either. Your choice, really.