HTML5

This is a work in
progress! For the latest updates from the HTML WG, possibly
including important bug fixes, please look at the editor's draft instead.

2.7 Fetching resources

When a user agent is to fetch a resource or
URL, optionally from an origin origin,
and optionally with a synchronous flag, a manual redirect
flag, a force same-origin flag, and/or a block cookies
flag, the following steps must be run. (When a URL is
to be fetched, the URL identifies a resource to be obtained.)

Let document be the appropriate
Document as given by the following list:

Remove any <fragment>
component from the generated address of the resource from which
Request-URIs are obtained.

If the origin of the appropriate
Document is not a scheme/host/port tuple, then the
Referer (sic) header must be
omitted, regardless of its value.

If the algorithm was not invoked with the synchronous
flag, perform the remaining steps asynchronously.

This is the main step.

If the resource is identified by an absolute URL,
and the resource is to be obtained using an idempotent action
(such as an HTTP GET or
equivalent), and it is already being downloaded for other
reasons (e.g. another invocation of this algorithm), and this
request would be identical to the previous one (e.g. same Accept and Origin headers), and the user agent is
configured such that it is to reuse the data from the existing
download instead of initiating a new one, then use the results of
the existing download instead of starting a new one.

Otherwise, if the resource is identified by an absolute
URL with a scheme that does not define a mechanism to
obtain the resource (e.g. it is a mailto:
URL) or that the user agent does not support, then act as if the
resource was an HTTP 204 No Content response with no other
metadata.

Otherwise, if the resource is identified by the
URLabout:blank, then the
resource is immediately available and consists of the empty
string, with no metadata.

Otherwise, at a time convenient to the user and the user agent,
download (or otherwise obtain) the resource, applying the
semantics of the relevant specifications (e.g. performing an HTTP
GET or POST operation, or reading the file from disk, dereferencing javascript: URLs,
etc).

For the purposes of the Referer (sic) header, use the
address of the resource from which Request-URIs are
obtained generated in the earlier step.

For the purposes of the Origin
header, if the fetching algorithm was
explicitly initiated from an origin, then the origin that initiated the HTTP request is origin. Otherwise, this is a request from
a "privacy-sensitive" context. [ORIGIN]

If the algorithm was not invoked with the block cookies
flag, and there are cookies to be set, then the user agent
must run the following substeps:

If the force same-origin flag is set and the
URL of the target of the redirect does not have the
same origin as the URL for which the
fetch algorithm was invoked

Abort these steps and return failure from this algorithm, as
if the remote host could not be contacted.

If the manual redirect flag is set

Continue, using the fetched resource (the redirect) as the
result of the algorithm.

Otherwise

First, apply any relevant requirements for redirects (such as
showing any appropriate prompts). Then, redo main step,
but using the target of the redirect as the resource to fetch,
rather than the original resource.

The HTTP specification requires that 301, 302,
and 307 redirects, when applied to methods other than the safe
methods, not be followed without user confirmation. That would
be an appropriate prompt for the purposes of the requirement in
the paragraph above. [HTTP]

If the algorithm was not invoked with the synchronous
flag: When the resource is available, or if there is an error
of some description, queue a task that uses the
resource as appropriate. If the resource can be processed
incrementally, as, for instance, with a progressively interlaced
JPEG or an HTML file, additional tasks may be queued to process
the data as it is downloaded. The task source for
these tasks is the
networking task source.

Otherwise, return the resource or error information to the
calling algorithm.

If the user agent can determine the actual length of the resource
being fetched for an instance of this
algorithm, and if that length is finite, then that length is the
file's size. Otherwise, the
subject of the algorithm (that is, the resource being fetched) has
no known size. (For
example, the HTTP Content-Length header might
provide this information.)

The user agent must also keep track of the number of bytes downloaded for
each instance of this algorithm. This number must exclude any
out-of-band metadata, such as HTTP headers.

The navigation
processing model handles redirects itself, overriding the
redirection handling that would be done by the fetching
algorithm.

Whether the type
sniffing rules apply to the fetched resource depends on the
algorithm that invokes the rules — they are not always
applicable.

2.7.1 Protocol concepts

User agents can implement a variety of transfer protocols, but
this specification mostly defines behavior in terms of HTTP. [HTTP]

The HTTP GET
method is equivalent to the default retrieval action of the
protocol. For example, RETR in FTP. Such actions are idempotent and
safe, in HTTP terms.

The HTTP response
codes are equivalent to statuses in other protocols that have
the same basic meanings. For example, a "file not found" error is
equivalent to a 404 code, a server error is equivalent to a 5xx
code, and so on.

The HTTP
headers are equivalent to fields in other protocols that have
the same basic meaning. For example, the HTTP authentication
headers are equivalent to the authentication aspects of the FTP
protocol.

2.7.2 Encrypted HTTP and related security concerns

Anything in this specification that refers to HTTP also applies
to HTTP-over-TLS, as represented by URLs
representing the https scheme.

User agents should report certificate errors to
the user and must either refuse to download resources sent with
erroneous certificates or must act as if such resources were in fact
served with no encryption.

User agents should warn the user that there is a potential
problem whenever the user visits a page that the user has previously
visited, if the page uses less secure encryption on the second
visit.

Not doing so can result in users not noticing man-in-the-middle
attacks.

If a user connects to a server with a self-signed certificate,
the user agent could allow the connection but just act as if there
had been no encryption. If the user agent instead allowed the user
to override the problem and then displayed the page as if it was
fully and safely encrypted, the user could be easily tricked into
accepting man-in-the-middle connections.

If a user connects to a server with full encryption, but the
page then refers to an external resource that has an expired
certificate, then the user agent will act as if the resource was
unavailable, possibly also reporting the problem to the user. If
the user agent instead allowed the resource to be used, then an
attacker could just look for "secure" sites that used resources
from a different host and only apply man-in-the-middle attacks to
that host, for example taking over scripts in the page.

If a user bookmarks a site that uses a CA-signed certificate,
and then later revisits that site directly but the site has started
using a self-signed certificate, the user agent could warn the user
that a man-in-the-middle attack is likely underway, instead of
simply acting as if the page was not encrypted.

2.7.3 Determining the type of a resource

ISSUE-125 (charset-vs-quotes) and ISSUE-126 (charset-vs-backslashes) block progress to Last Call

The Content-Type metadata of a
resource must be obtained and interpreted in a manner consistent
with the requirements of the Media Type Sniffing
specification. [MIMESNIFF]

The sniffed type of a
resource must be found in a manner consistent with the
requirements given in the Media Type Sniffing
specification for finding the sniffed-type of the relevant
sequence of octets. [MIMESNIFF]

The rules for sniffing
images specifically and the rules for distingushing if a resource is text or
binary are also defined in the Media Type Sniffing
specification. Both sets of rules return a MIME type as
their result. [MIMESNIFF]

It is imperative that the rules in the
Media Type Sniffing specification be followed
exactly. When a user agent uses different heuristics for content
type detection than the server expects, security problems can
occur. For more details, see the Media Type Sniffing
specification. [MIMESNIFF]

The algorithm for extracting an encoding from a
Content-Type, given a string s, is as
follows. It either returns an encoding or nothing.

Let position be a pointed into s, initially pointing at the start of the
string.

Loop: Find the first seven characters in s after position that are an
ASCII case-insensitive match for the word "charset". If no such match is found, return nothing
and abort these steps.

Skip any U+0009, U+000A, U+000C, U+000D, or U+0020
characters that immediately follow the word "charset" (there might not be any).

If the next character is not a U+003D EQUALS SIGN ('='),
then move position to point just before that
next character, and jump back to the step labeled
loop.

Skip any U+0009, U+000A, U+000C, U+000D, or U+0020
characters that immediately follow the equals sign (there might not
be any).

Process the next character as follows:

If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s

If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s

Return the encoding corresponding to the string between this character and the next earliest occurrence of this character.

If it is an unmatched U+0022 QUOTATION MARK ('"')

If it is an unmatched U+0027 APOSTROPHE ("'")

If there is no next character

Return nothing.

Otherwise

Return the encoding corresponding to the string from this
character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or
U+003B character or the end of s, whichever
comes first.

This requirement is a willful violation
of the HTTP specification (for example, HTTP doesn't allow the use
of single quotes and requires supporting a backslash-escape
mechanism that is not supported by this algorithm), motivated by the need for
backwards compatibility with legacy content. [HTTP]