Larry Aronson describes how the Web works, provides a bit of Web history for context, explains the client/server architecture of the Web and how it is hyperlinked, presents the Web Bestiary of acronyms and definitions and discusses the philosophy and implications of HTML5.

HTML is the framework of the Web. This chapter describes how the Web works and provides a bit of Web history for context. You will learn about the client/server architecture of the Web and how it is hyperlinked. I'll present the Web Bestiary of acronyms and definitions and discuss the philosophy and implications of HTML5.

Although this chapter is about the Web and HTML, it actually contains very little HTML. If you want to get right into learning the HTML language, skip this chapter and go to the next. You can come back here later to help consolidate what you have learned.

HTML: The Language of the Web

HyperText Markup Language (HTML) is the language of the Web. If you could listen to the conversation between your computer and the websites you visit, you would hear HTML spoken. Web servers accept requests from your browser as you visit and interact with the sites they host. In reply, the servers return marked-up content that your browser formats into the web page you see. Web servers also send requests to each other, gathering and exchanging data that power search engines and make a rich variety of social and commercial transactions possible.

HTML is not a programming language like C, Perl, or Ruby. HTML is a semantic language for marking up text. The markup provides a description of the content that Web browsers use to construct the corresponding web page. Links are defined in HTML. This ability to have active references in a document to other documents, no matter where they are physically located, is very powerful. All of the Web's resources are addressable using a Uniform Resource Locator (URL). Any information can be easily located and linked with related content, creating frictionless connectivity.

The Web hosts many protocols and practices, but HTML is the foundation, providing the basic language to mark up text content into a structured document by describing the roles and attributes of its various elements. A companion technology, Cascading Style Sheets (CSS), lets you select document elements and apply styling rules for presentation. CSS rules can be mixed into the HTML code or can reside in external files that can be employed across an entire website. This keeps content creators and site designers from stepping all over each other's work. HTML describes the page's content elements, and CSS tells the browser how they should look (or sound.) The browser can override the CSS instructions or ignore them.

Example 1.1 creates a very simple web page. You can copy this HTML code into a plain text file on your computer and open it in any browser. Give it a filename ending in the extension .html.

Example 1.1: HTML for a very simple web page

<!DOCTYPE html><html><head><title>Example 1.1</title><style type="text/css">h1 { text-align: center; }</style></head><body><h1>Hello World Wide Web</h1><p>
Welcome to the first of many webpages.
I promise they will get more interesting than this.
</p></body></html>

The code in Example 1.1 (shown in boldface) consists of two parts: a document body containing the page's content, preceded by a head section that contains information about the document. In this example, the head section contains the document's title and a CSS style rule to center the page's heading. The body consists of a level 1 heading followed by a paragraph. The result should look something like Figure 1.1.

This brings up a fundamental principle about how the Web works: Web authors should not make assumptions about their readers, the characteristics of their display devices, or their formatting preferences. This is especially important with mobile Web users and people with visual disabilities. A Web author or developer shouldn't even assume that a site visitor is human! Websites are constantly visited by automated programs that gather and catalog information about the Web. The general term user agent is used to describe any software application or program that can talk to a web server. A modern website regards visits from all user agents with the same importance as human visitors using Web browsers. The best approach is to keep the HTML simple so that it provides a semantic description of the various content elements and leaves the presentation details to the reader.

The other major player on the Web programming team is JavaScript, a programming language that runs inside a browser and manipulates HTML page elements in response to user actions and other events. There are other scripting languages besides JavaScript, but it is the most popular. Also, JavaScript syntax and terms are used in the HTML5 specification. Like CSS, JavaScript code can be embedded within the HTML source code of a web page or can be imported from a separate file. User agents other than browsers generally ignore JavaScript and other embedded executable code. It can be dangerous for robots.

Robots?!

Robots are a very important class of Web user. They are automated computer programs that run on Internet servers and visit web pages the same way people do using a browser. But instead of presenting the page, the robot analyzes it, stores information about the page in a database, and decides what page to visit next using that information. This is how Google, Yahoo!, Bing, and other search engines work. Other robots perform similar data collection for marketing and academic purposes. Robots are often called "spiders" because of how they seem to "crawl" over the Web from one link to the next. Also, there are malicious robots. These automatic programs leave spam comments on blogs or look for security loopholes to gain control of resources with which they should not be messing. Bad robots!

When creating content for the Web, you generally are not concerned with any of this. Most of the HTML structure that deals with browsers, robots, and widgets is supplied by the Web editing software you use or by server-side scripts and template systems. If you are editing content directly online, all you need to understand is how to mark up the content with simple HTML elements. Web developers—that is, programmers as opposed to authors—need to fully understand how these three principal components—HTML, CSS, and scripting—work together to form the framework of the Web (see Figure 1.2).

By the way, did I mention that all of this is essentially free? It is free in two senses of the word. It's free because there is no acquisition cost, and free because you can use it for your own purposes. With only minor limitations, all the HTML, CSS, and scripting that go into a Web page are available for you to examine, copy, and reuse. Tim Berners-Lee, the inventor of HTML, the URL, and the HTTP protocol that web servers and user agents use to talk to each other, put all these components into the public domain. Working at CERN, the European Center for Nuclear Research, he was trying to find a better way for large teams of researchers, working in different countries with different word processors, to quickly publish research papers. Patent rights and Nobel Prizes were at stake. In a post to the alt.hypertext newsgroup on August 6, 1991, which was effectively the Web's birth announcement, Berners-Lee wrote:

The WWW project was started to allow high energy physicists to share data, news, and documentation. We are very interested in spreading the web to other areas, and having gateway servers for other data. Collaborators welcome!

Twenty years later, Berners-Lee is still very much involved in the evolution of the Web as head of the World Wide Web Consortium (W3C). I stress "evolution" here to point out that, while the Web has transformed society, freeing us to work and play in a global sea of information, a lot of that happened by accident. HTML is still a work in progress.