Chapter 3: Learn to Program HTML in 21 Minutes

Hardcopy featured a 10-year-old boy one night. His psychotic mother wouldn't take her meds and was beating him up. He wanted to live with his father but the judge wouldn't change his custody arrangement. So the 10-year-old kid built a Web site to encourage Internetters to contact the judge in support of a change in custody.

I tell this story to my friends who ask me for help in building their static HTML Web sites: "The abused 10-year-old got his site to work; I think you can, too."

If they persist, I tell them to find a page that they like, choose Save As in Netscape, then edit the text with the editor of their choice. I've never known anyone who couldn't throw together a simple page in under 21 minutes.

"One of Our Local Webmasters"

Having said all of that, now I'm going to explain how to write HTML. Why? It all started the day that Jim Clark, chairman of Netscape, came to MIT to give a Laboratory for Computer Science Distinguished Lecture. In previous years, the lecturers had been grizzled researchers who'd toiled anonymously for decades at places like Bell Labs and Stanford. In 1996, we had two billionaires: Bill Gates and Jim Clark. Before the lecture, Michael Dertouzos, the director of our lab, was touring Clark around the CS building. Clark stopped in the hallway outside my office to ask some questions about a framed photograph. The official hosts came into my office and dragged me away from my terminal to tell Jim Clark how I'd taken a picture of a waterfall in New Hampshire.

This was my big moment; I was being introduced to the one man in the world with enough power to fix everything wrong with the Web standards. I was sure that Dertouzos would tell him about what a computer science genius I was. He was going to talk about my old idea to add semantic tags to HTML documents, about the work I'd done to make medical record databases talk to each other to support Internet-wide epidemiology, about all the collaboration systems I'd built that hundreds of Web sites around the world were using.

"This is Philip Greenspun," the lab director began, "one of our local webmasters."

Yeah.

Anyway, part of a real webmaster's job is to assist new users in getting their pages together. So in that spirit, here is my five-minute HTML tutorial.

You May Already Have Won $1 Million

Then again, maybe not. But at least you already know how to write legal HTML:

My Samoyed is really hairy.

That is a perfectly acceptable HTML document. Type it up in a text editor, save it as index.html, and put it on your Web server. A Web server can serve it. A user with Netscape Navigator can view it. A search engine can index it.

Suppose you want something more expressive. You want the word really to be in italic type:

My Samoyed is <I>really</I> hairy.

HTML stands for Hypertext Markup Language. The <I> is markup. It tells the browser to start rendering words in italics. The </I> closes the <I> element and stops the italics If you want to be more tasteful, you can tell the browser to emphasize the word really:

My Samoyed is <EM>really</EM> hairy.

Most browsers use italics to emphasize, but some use boldface and browsers for ancient ASCII terminals (e.g., Lynx) have to ignore this tag or come up with a clever rendering method. A picky user with the right browser program can even customize the rendering of particular tags.

There are a few dozen more tags in HTML. You can learn them by
choosing View Source from Netscape Navigator when visiting sites whose
formatting you admire. This is usually how I learn markup. You can learn
them by visiting Web Tools Review and clicking
through to one of the comprehensive online HTML guides. Or you can buy
HTML: The Definitive Guide
(Musciano and Kennedy,
O'Reilly, 1996).

Document Structure

Armed with a big pile of tags, you can start strewing them among your words more or less at random. Though browsers are extremely forgiving of technically illegal markup, it is useful to know that an HTML document officially consists of two pieces: the head and the body.The head contains information about the document as a whole, such as the title. The body contains information to be displayed by the user's browser.

Another structure issue is that you should try to make sure that you close every element that you open. So if your document has a <BODY> then it should have a </BODY> at the end. If you start an HTML table with a <TABLE> and don't have a </TABLE>, Netscape Navigator may display nothing. Tags can overlap, but you should close the most recently opened before the rest. For example, for something both boldface and italic:

My Samoyed is <B><I>really</I></B> hairy.

Something that confuses a lot of new users is that the <P> element used to surround a paragraph has an optional closing tag </P>. Browsers by convention assume that an open <P> element is implicitly closed by the next <P> element. This leads a lot of publishers (including lazy old me) to use <P> elements as paragraph separators.

Figure 3-2 shows the source code from which I will usually start out a Web document. Though saving this code in a file named "something.html" will cause my Web server program to tell browsers that this is an HTML document, the <HTML> element at the top provides insurance. Note that this tag is closed at the end of the document.

I put in a <HEAD> element mostly so that I can legally use the <TITLE> element to give this document a name. Whatever text I place between <TITLE> and </TITLE> will appear at the top of the user's Netscape window, on his Go menu, and in his bookmarks menu should he bookmark this page. After closing the head with a </HEAD>, I open the body of the document with a <BODY> element, to which I've added some parameters to manually set the background to white and the text to black. Some Web browsers default to a gray background, and the resulting lack of contrast between background and text offends me so much that I abandon most of my principles and change the colors manually.

Just below the body, I have a headline, size 2, wrapped in an <H2> element. This will be displayed to the user at the top of the page. I probably should use <H1> but browsers typically render that in a font too huge even for my bloated ego. Underneath the headline, I'll often put "by Philip Greenspun" or something else indicating authorship, perhaps with a link to the full work. After that comes a horizontal rule tag: <HR>. The one really good piece of advice I've gotten from a graphic designer was Dave Siegel's admonition against overuse of horizontal rules. I use <H3> headlines in the text to separate sections and only put an <HR> at the very bottom of the document.

Underneath the last <HR>, I sign my document with "philg@mit.edu" (my email address, unchanged since 1976). The <ADDRESS> element usually results in an italics rendering. The <A HREF= says "this is a hyperlink." If the reader clicks anywhere from here up until the </A> then the browser should send him to http://philip.greenspun.com/ I think that all documents on the Web should have a signature like this. Readers expect that they can scroll to the bottom of a browser window and find out who is responsible for what they've just read.

Note: I could make this an <a href="mailto:philg@mit.edu">

so that the reader's browser would pop up a new e-mail message window. I did it that way for a few years but then I realized that a lot of users were asking, "Where can I find x on your site?" So I instead direct them to my home page which functions as a sort of table of contents for my site and which also has a link to my full-text search engine. I guess I should probably change my signature to read "Philip Greenspun" instead of "philg@mit.edu" but I haven't.

Now That You Know How to Write HTML, Don't

"Owing to the neglect of our defences and the mishandling of the German problem in the last five years, we seem to be very near the bleak choice between War and Shame. My feeling is that we shall choose Shame, and then have War thrown in a little later, on even more adverse terms than at present."

-Winston Churchill in a letter to Lord Moyne, 1938 [Gilbert 1991]

HTML represents the worst of two worlds. We could have taken a formatting language and added hypertext anchors so that users had beautifully designed documents on their desktops. We could have developed a powerful document structure language so that browsers could automatically do intelligent things with Web documents. What we actually have with HTML is a hybrid: ugly documents without formatting or structural information.

Eventually the Web will work like a naïve user would expect it to. You ask your computer to find you the cheapest pair of blue jeans being hawked on the World Wide Web and ten seconds later you're staring at a photo of the product and being asked to confirm the purchase. You see an announcement for a concert and click a button on your Web browser to add the date to your calendar; the information gets transferred automatically. More powerful formatting isn't far off, either. Eventually there will be browser-independent ways to render the average novel readably.

None of this will happen without radical changes to HTML, however. We'll need semantic tags so that publishers can say, in a way that a computer can understand, "this page sells blue jeans," and "the price of these jeans is $25 U.S." Whether we need them or not, we are sure to get new formatting tags with every new generation of browser. (Personally I can't wait to be able to caption photographs and figures, an idea that was new in word processing programs of the 1960s.)

If the information that you are publishing is at all structured, it doesn't make sense to store it in HTML files. You are throwing all of that structure away. When HTML version 7.3 comes out, you'll have to manually edit 1,000 files to take advantage of the new features.

What you need is a database. You don't need a scary huge relational database management system like I discuss later in the book. If you aren't updating your data in real-time, an ordinary text file is fine. You just don't want it to be formatted in HTML.

For example, suppose that you are putting a company phone directory on the Web. You can define a structured format like this:

first name|last name|department|office number|home number|location

There is one line for each person in the directory. Fields are separated by vertical bars. So a file at MIT might look like this:

and so on. From this one file, a 20-line Perl or Tcl script can generate

a public Web page showing names and office phone numbers

a public Web page for each department showing names and office phone numbers

a private Web page for each department showing names and home phone numbers

If you decide to start using a new HTML feature, you don't have to edit all these pages manually. You just need to change a few lines in the Perl or Tcl script and then rerun it to regenerate the HTML pages.

The high level message here is that you should think about the structure of the information you are publishing first. Then think about the best way to build an investment in that structure and preserve it. Finally, devote a bit of time to the formatting of the final HTML that you generate.

It's Hard to Mess Up a Simple Page

People with limited time, money, and experience usually build fairly usable Web sites. However, there is no publishing concept so simple that money, knowledge of HTML arcana, and graphic design can't make slow, confusing, and painful for users. After you've tarted up your site with frames, graphics, and color, check the server log to see how much traffic has fallen. Then ask yourself whether you shouldn't have thought about user interface stability.

CD-ROMs are faster, cheaper, more reliable, and a more engaging audio/visual experience than the Web. Why then do they sit on the shelf while users greedily surf the slow, unreliable, expensive Web? Stability of user interface.

There are many things wrong with HTML. It is primitive as a formatting language and it is almost worthless for defining document structure. Nonetheless, the original Web/HTML model has one big advantage: All Web pages look and work more or less the same. You see something black, you read it. You see something gray, that's the background. You see something blue (or underlined), you click on it.

When you use a set of traditional Web sites, you don't have to learn anything new. Every CD-ROM, on the other hand, has a sui generis user interface. Somebody thought it would be cute to put a little navigation cube at the bottom right of the screen. Somebody else thought it would be neat if you clicked on the righthand page of an open book to take you to the next page. Meanwhile, you sit there for 15 seconds feeling frustrated, with no clue that you are supposed to do anything with that book graphic on the screen. The CD-ROM goes back on the shelf.

The beauty of Netscape 2.0 and more recent browsers is that they
allow the graphic designers behind Web sites to make their sites just as
opaque and hard to use as CD-ROMs. Graphic designers are not user
interface designers. If you read a book like the Macintosh Human
Interface Guidelines (Apple Computer, Inc.; Addison-Wesley, 1993),
you will appreciate what kind of thought goes into a well-designed user
interface. Most of it has nothing to do with graphics and
appearance. Pull-down menus are not better than pop-up menus because
they look prettier; they are better because you always know exactly
where to find the Print command.

Some of the bad things a graphic designer can do with a page were possible even way back in the days of Netscape 1.1. A graphic designer might note that most of the text on a page was hyperlinks and decide to just make all the text black (text=#000000, link=#000000, vlink=#000000). Alternatively, he or she may choose a funky color for a background and then three more funky colors for text, links, and visited links. Either way, users have no way of knowing what is a hyperlink and what isn't. Often designers get bored and change these colors even for different pages on the same site.

Frames are probably the worst Netscape innovation yet. The graphic designer, who has no idea what size or shape screen you have, is blithely chopping it up. Screen space is any user's most precious resource and frames give the publisher the tools to waste most of with ads, navigation "aids," and other items extraneous to the document that the user clicked on. What's worse, with Netscape Navigator 2.0, when the user clicked on the Back button to undo his last mouse click, Navigator would undo hundreds of mouse clicks and pop him out of the framed site altogether. Newer Web browsers handle frames a little more gracefully, but none of them handle scrolling as well as NCSA Mosaic did in 1993. In the old days, any Web site that brought up scroll bars could be scrolled down with a press of the space bar. With frames, even if there is only one scroll bar on screen, the space key does nothing until you click the mouse in the subwindow that owns the scroll bar.

I'm not saying that there isn't a place in this world for pretty Web sites. Nor even that frames cannot sometimes be useful. However, the prettiness and utility of frames must be weighed against the cold shock of unfamiliar user interface that will greet the user. This comparison is very seldom done and that's a shame.

Why Graphic Designers Just Don't Get It

Graphic designers get interfaces so wrong because they never figured out that they aren't building CD-ROMs. With a CD-ROM, you can control the user's access to the content. Borrow a copy of David Siegel's Creating Killer Web Sites and note that he urges you to have an "entry tunnel" of three pages with useless slow-to-load GIFs on them. Then there should be an "exit tunnel" with three more full-page GIFs. In between, there are a handful of "content" pages that constitute the site per se.

Siegel is making some implicit assumptions: that there are no users with text-only browsers; that users have a fast enough Net connection that they won't have to wait 45 seconds before getting to the content of a site; that there are no users who've turned off auto image loading; that there is some obvious place to put these tunnels on a site with thousands of pages. Even if all of those things are true, if the internal pages do indeed contain any content, AltaVista

will roar through and wreck everything. People aren't going to enter the site by typing in "http://www.greedy.com" and then let themselves be led around by the nose by you. They will find the site using a search engine by typing a query string that is of interest to them. The search engine will cough up a list of URLs that it thinks are of interest to them. AltaVista does not think a Dave Siegel "entry tunnel" is "killer". In fact, it might not even bother to index a page that is just one GIF.

AltaVista is going to send a user directly to the URL on your server that has the text closest to the user's query string. AltaVista doesn't care that your $125/hour graphic designer thought this URL should be one-third of a frame. AltaVista doesn't care that the links to your home page and related articles are in another subwindow.

So, if you intend to get radical by putting actual content on your Web server, then it is probably a good idea to make each URL stand on its own. Throw in a link to the page author, the service home page, and the next page in a sequence if the URL is part of a linear work. Remember, the Web is not there so that you can impose what you think is cool on readers. Each reader will have his own view of the Web. Maybe that view was barfed up by a search engine. Maybe that view is links from his friend's home page. Maybe that view is a link from a personalization service that sweeps the Internet every night to find links and stories that fit your interest profile. Your task as a Web publisher is to produce works that will fit seamlessly not into the Web as you see it, but into the many Webs that your readers see.

Summary

Here's what you might have learned in this chapter:

learning basic HTML shouldn't take more than a few minutes

the more HTML you know, the uglier and harder to use your site is likely to be

HTML is not powerful enough to express the most interesting structural characteristics of your documents; consider using a database of some kind instead and generating your HTML pages programmatically

In the next chapter we'll discuss building and presenting libraries of photographs.

Reader's Comments

Using HTML now is like using machine language programming was before compilers were invented. Programs, such as Claris HOMEPAGE, have put HTML out of business. You don't have to learn any makeup; just type and select options, as you would with any text editor.

If you think that HTML is bad as a formatting
language due to the poor quality documents that
result, think about it from the web server/client
point of view. It's even worse.

There is so much logic that has to go into parsing
an HTML page that web programs (servers, clients,
CGI scripts) have to be incredibly complex. This
is error prone. Because HTML was designed as a
human-readable formatting language (based upon
SGML, which is the original evil) it is inherently
complicated from the computer's point of view.
People forget formatting tags, or use incorrect
formatting tags, or type in the formatting tags
using incorrect syntax. There are a million
problems of this type that HTML parsers have to
deal with. More fundamentally, HTML is just a
stream of text and lacks any fundamental structure
which makes the parsing even more difficult. Plus
there are tons of ambiguities in the HTML specs
coupled with archaic conventions which must be
adhered to even though they are not specified
fully in any spec.

HTML is just a big lose, no matter which way you
look at it. I think that someone ought to come up
with a new web formatting language, and develop a
web server/client which can handle both HTML and
this new, more intelligent markup language.
Hopefully this would allow HTML to die the slow
and painful death it deserves.

discussion:
HTML is an application of SGML. SGML is good.
Part of the definition of HTML (in SGML) specifies:
MINIMIZE
OMITTAG YES
SHORTTAG YES
In SGML-speak, this specifies two useful features.
First, any tag which could be deduced from the
structure of the document may be omitted. This
includes not only , but also , ,
, and many others. The second feature,
SHORTTAG, allows markup to be expressed more
concisely, e.g., Shortened
End Tags for Fun and Profit>.
Unfortunately, since very few Web browsers
actually implement HTML as is it formally defined,
these features cannot be relied upon. However,
it is a Simple Matter of Programming to combine
existing SGML tools with `make' and a copy of the
HTML DTD to automatically process Web documents
through an SGML normalizer which transforms HTML
into a common subset understood by all browsers.
At the same time, other advanced SGML features
could also be employed to advantage, such as
user-defined entities. (See, for example,
The FreeBSD Project's Web pages at
for an example.)
For my personal Web publishing efforts, I use
PSGML, an SGML mode for Emacs, which understands
DTDs to a limited extent and is capable of
performing most sorts of normalization
automatically. It also has editing features such
as an element attribute editor, so you don't have
to look up whether IMG takes a SRC or an HREF
attribute---it will just present the full list,
and you pick the ones you want.

I agree wholeheartedly with most of your views on
website design, with all its excruciatingly slow-
loading pages and unecessary gimmicks. However,
your comments on graphic designers is a bit off-
target. Any PROFESSIONAL designer considers users'
loading-speed, bandwith, size of screen and so on,
the same as would be done with any printed product
(public needs first). You seem to lump together
graphic designers and computer graphics people. The
first will examine the audience's needs before
thinking of ANY concept or design idea,this being a
very important part of their training. The latter
knows how to visually masturbate with Photoshop and
all, with little consideration to how it'll all
come out in the end. Those are responsible for
most of the terrible sites you talk about.
Also, you seem to forget that a well-designed
(read:pretty) site will beat out in popularity any
site containing the exact same information. People
are used to "eye candy", and any crappy page
screams "amateur" to many people. Good-looking
sites, designed by experienced graphic designers,
are what makes the internet so enjoyable today. As
long as they load quickly :)

The part about people's misuse of HTML is on-target, but the bashing of HTML itself strikes a sour note.

The reason for the WWW's popularity is precisely the laxity and imperfection in HTML as opposed to SGML. It provides a low barrier to entry for the hobbyist or non-technical person. Someone feeding comma-separated lists into a database would not sustain their interest long enough to put their accumulated knowledge online. Someone doing "Save As" and finding some specialized tag set would not be able to transfer that coding to their own web page on a different topic. If browser writers had to validate and reject all non-conforming documents we'd still be on Netscape 1.0 and nobody without a CS or engineering degree would have a home page.

I've observed many struggle with SGML -- waiting 1/2 hour for a document to validate, abandoning WYSIWYG editors in favour of Notepad, typing verbose tags to convey meaningless distinctions -- this is not the kind of thing that could or should "take over the world".

Let's look at the example of "<em> preferable to <i>". Italics is a typesetter's convention to indicate emphasis. <em> is a contrived convention, takes an extra character, and conveys nothing extra. 99.9% of the population decides to italicize based on a few simple rules -- such as "is this phrase like anything else that is normally italicized".

We've heard for years, "just mark everything up with maniacal granularity and someday wonderful search engines will make life easy". Yet nobody has the inclination or funding to write search engines that separate out the 10 different reasons why one might want to emphasize a phrase. It just adds to the cognitive load of the writer. I've marked up many thousands of "program keyword" tags when everyone else was using "bold" but nobody ever wrote a search engine to reward these efforts. It's a mug's game.

The only such distinction that might be worthwhile is for book titles. One could add an automatic linking capability. Yet even then it turns out to be easier in practice to use Perl and search for a set of known strings between <i> tags and treat them as book titles, than to look for <cite> and trust that some omniscient author will encode link-target information.