I've been thinking about markdown tables a bit lately. I've had in mind to
follow up on my definition list
proposal with a second proposal for the creation and editing of simple
tables in Markdown. For better or for worse,
an aside on the
markdown-discuss mail list led to
a longish thread about a syntax for continuing lines in tables (not
to mention a long aside on the use of monospaced fonts, but I digress),
wherein I realized, after
an open-minded post
from MultiMarkdown's Fletcher Penney, that I needed to
set to working up this request for comments sooner rather than later.

Requirements

All of which is to say that this blog entry is a request for comments on a
proposed sytnax for simple tables in Markdown. The requirements for such a
feature, to my thinking, are:

By “simple tables” in that first bullet, I mean that they should look good
in 78 character-wide monospaced plain text. Anything more complicated should
just be done in XHTML. My goal is to be able to handle the vast majority of
simple cases, not to handle every kind of table. That's not to say that one
won't be able to use the syntax to create more complicated tables, just that
it might not be appropriate to do so, and many more advanced features of
tables will just have to be done in XHTML.

And by “implicit formatting” in the second bullet, I mean that the syntax
should use the bare minimum number of punctuation characters to provide hints
about formatting. Another way to think about it is that formatting hints
should be completely invisible to a casual reader of the Markdown text.

Most of the rest of the requirements I borrowed
from MultiMarkdown, with the last
bullet thrown in just to cover anything I might have missed. The MultiMarkdown
syntax appears to be a superset of
the PHP Markdown Extra syntax, so that's covered, too.

Prior Art: Databases

When I think about the display of tables in plain text, the first piece of
prior art I think of is the output from command-line database clients.
Database developers have been thinking about tables since, well, the
beginning, so it makes sense to see what they're doing. So I wrote a bit of
SQL and ran it in three databases. The SQL builds a table with an integer, a
short name, a textual description, and a decimal number. Here's the code:

My goal here was to see how the database client would format a variety of
data formats, as well as a textual column (“description”) with newlines in it
(and a Markdown list, no less!), as the newlines will force the output to
appear on multiple lines for a single row. This is one of the features that is
missing from the existing Markdown implementations, which all require that the
text all be on a single line.

The first database client in which I ran this code
was psql 8.3, the interactive
terminal for PostgreSQL 8.3. Its output looks like this:

As you can see, PostgreSQL properly right-aligned the integer and numeric
columns. It also has a very nice syntax for demonstrating continuing lines for
a given column: the colon. The colon is really nice here because it looks kind
of like a broken pipe character, which is an excellent mnemonic for a string
of text that breaks over multiple lines. Really, this is just a very
nice output format overall.

The next database client I tried was mysql 5.0, the command-line client for
MySQL 5.0. Its output looks like this:

Once again we have very good alignment of the numeric data types.
Furthermore, MySQL uses exactly the same syntax as PostgreSQL to represent the
separation between column headers and column rows, although the PostgreSQL
version is a bit more minimalist. The MySQL version just hast a little
more stuff in it

Where the MySQL version fails, however, is in the representation of the
continuing lines for the “dojigger” row. First of all, it set the width of the
“description” column to the longest value in that column, but since that
longest value includes newlines, it actually ends up being much too long—much
longer than PostgreSQL's representation of the same column. And second, as a
symptom of that problem, nothing special is done with the wrapped lines. The
newlines are simply output like any other character, with no attempt to line
up the column. This has the side effect of orphaning the price for the
“dojiggger” after the last line of the continuing description. So its
alignment is shot, too.

To be fair, PostgreSQL's display featured almost exactly the same handling
of continuing columns prior to version 8.2. But I do think that their solution
featuring the colons is a good one.

The last database client I tried was SQLite 3.6. This client is the most different of all. I set
.header ON and .mode column and got this output:

There are a few interesting features to this syntax, including support for
multiple lines of headers, multicolumn cells alignment queues, and captions. I
like nearly everything about this syntax, except for two things:

There is no support for multiline cell values.

The exlicit alignment queues are, to my eye, distracting.

The first issue can be solved rather nicely with PostgreSQL's use of the
colon to indicate continued lines. I think it could even optionally use colons
to highlight all rows in the output, not just the continuing one,
as suggested
by Benoit Perdu on the markdown-discuss list:

I think I prefer the colon only in front of the continuing cell, but see no
reason why both couldn't be supported.

The second issue is a bit more subtle. My problem with the alignment hints,
embodied by the colons in the header line, is that to the reader of the
plain-text Markdown they fill no obvious purpose, but are provided purely for
the convenience of the parser. In my opinion, if there is some part of the
Markdown syntax that provides no obvious meaning to the user, it should be
omitted. I take this point of view not only for my own selfish purposes, which
are, of course, many and rampant, but from John Gruber's original design goal
for Markdown, which was:

The overriding design goal for Markdown’s formatting syntax is to make it
as readable as possible. The idea is that a Markdown-formatted document
should be publishable as-is, as plain text, without looking like it’s been
marked up with tags or formatting instructions. While Markdown’s syntax has
been influenced by several existing text-to-HTML filters, the single biggest
source of inspiration for Markdown’s syntax is the format of plain text
email.

To me, those colons are formatting instructions. So, how else could we
support alignment of cells but with formatting instructions? Why, by
formatting the cells themselves, of course. Take a look again at the
PostgreSQL and MySQL outputs. both simply align values in their cells. There
is absolutely no reason why a decent parser couldn't do the same on a
cell-by-cell basis if the table Markdown follows these simple rules:

For a left-aligned cell, the content should have no more than one space
between the pipe character that precedes it, or the beginning of the
line.

For a right-aligned cell, the content should have no more than one space
between itself and the pipe charater that succeeds it, or the end of the
line.

For a centered cell, the content should have at least two characters
between itself and both its left and right borders.

If a cell has one space before and one space after its content, it is
assumed to be left-aligned unless the cell that precedes it or, in the
case of the first cell, the cell that succeeds it, is right-aligned.

What this means, in effect, is that you can create tables wherein you line
things up for proper display with a proportional font and, in general, the
Markdown parser will know what you mean. A quick example, borrowing from the
PostgreSQL output:

The table headers are all center-aligned, because they all have 2 or more
spaces on each side of their values

The contents of the “id” column are all right-aligned. This includes
1024, which ambiguously has only one space on each side of it, so it makes
the determination based on the preceding line.

The contents of the “name” column are all left-aligned. This includes
“thingamabob”, which ambiguously has only one space on each side of it, so
it makes the determination based on the preceding line.

The contents of the “description” column are also all left-aligned. This
includes first row, which ambiguously has only one space on each side of
it, so it makes the determination based on the succeeding
line.

And finally, the contents of the “price” column are all right-aligned.
This includes 102.98, which ambiguously has only one space on each side of
it, so it makes the determination based on the preceding line.

And that's it. The alignments are perfectly clear to the parser and highly
legible to the reader. No further markup is required.

Proposed Syntax

So, with this review, I'd like to propose the following syntax. It is
inspired largely by a combination of PostgreSQL and MySQL's output, as well as
by MultiMarkdown's syntax.

A table row is identifiable by the use of one or more pipe
(|) characters in a line of text, aside from those found in a
literal span (backticks).

Table headers are identified as a table row with the
immediately-following line containing
only -, |, +, :or
spaces. (This is the same as the MultiMarkdown syntax, but with the
addition fo the plus sign.)

Columns are separated by |, except on the header underline,
where they may optionally be separated by +, and on continuing
lines (see next point).

Lines that continue content from one or more cells from a previous line
must use : to separate cells with continued content. The
content of such cells must line up with the cell width on the first line,
determined by the number of spaces (tabs won't work). They may optionally
demarcate all cells on continued lines, or just the cells that contain
continued content.

Alignment of cell content is to be determined on a cell-by-cell basis,
with reference to the same cell on the preceding or succeeding line as
necessary to resolve ambiguities.

To indicate that a cell should span multiple columns, there should be
additional pipes (|) at the end of the cell, as in
MultiMarkdown. If the cell in question is at the end of the row, then of
course that means that pipes are not optional at the end of that row.

You can use normal Markdown markup within the table cells, including
multiline formats such as lists, as long as they are properly indented and
denoted by colons on succeeding lines.

Captions are optional, but if present must be at the beginning of the
line immediately preceding or following the table, start
with [ and end with ], as in MultiMarkdown. If
you have a caption before and after the table, only the first match will
be used.

If you have a caption, you can also have a label, allowing you to create
anchors pointing to the table, as in MultiMarkdown. If there is no label,
then the caption acts as the label.

Cells may not be empty, except as represented by the appropriate number
of space characters to match the width of the cell in all rows.

As in MultiMarkdown. You can create multiple <tbody>
tags within a table by having a single empty line between rows of the
table.

Sound like a lot? Well, if you're acquainted with MultiMarkdown's syntax,
it's essentially the same, but with these few changes:

Implicit cell alignment

Cell content continuation

Stricter use of space, for proper alignment in plain text (which all of
the MultiMarkdown examples I've seen tend to do anyway)

Allow + to separate columns in the header-demarking lines

A table does not have to start right at the beginning of a line

I think that, for purposes of backwards compatibility, we could still allow
the use of : in the header lines to indicate alignment, thus also
providing a method to override implicit alignment in those rare cases where
you really need to do so. I think that the only other change I would make is
to eliminate the requirement that the first row be made the table header row
if now header line is present. But that's a gimme, really.

Takin the original MultiMarkdown example and rework it with these changes
yields:

Comments?

I think I've gone on long enough here, especially since it ultimately comes
down to some refinements to the MultiMarkdown syntax. Ultimately, what I'm
trying to do here is to push MultiMarkdown to be just a bit more
Markdownish (by which I mean that it's more natural to read as plain text), as
well as to add a little more support for some advanced features. The fact that
I'll be able to cut-and-paste the output from my favorite database utilities
is a handy bonus.

As it happens, John Gruber today
posted
a comment to the markdown-discuss mail list in which he says (not for the
first time, I expect):

A hypothetical official table syntax for Markdown will almost certainly
look very much, if not exactly, like Michel's table syntax in PHP Markdown
Extra.

I hope that he finds this post in that vein, as my goals here were to
embrace the PHP Markdown Extra and MultiMarkdown formats, make a few tweaks,
and see what people think, with an eye toward contributing toward a (currently
hypothetical) official table syntax.

So what do you think? Please leave a comment, or comment on
the markdown-discuss list, where I'll post a synopsis of my proposal
and a link to this entry. What have I missed? What mistakes have I made? What
do you like? What do you hate? Please do let me know.

I realize that greater minds than mind have likely given a lot of thought
to how best to implement a “natural” syntax for definition lists
in Markdown. The best I’ve seen, however, is that
implemented
by PHP Markdown Extra, which is
also supported
by MultiMarkdown. Given
the prevalence of these two libraries, I’m assuming that the syntax become the
de-facto standard for definition lists in Markdown. But, to my mind at least,
it leaves something to be desired. Here’s an extended example, taken from the
PHP Markdown Extra documentation, including multiple definitions, multiple
paragraphs, and nested formatting (lists, code block):

Term 1
: This is a definition with two paragraphs. Lorem ipsum
dolor sit amet, consectetuer adipiscing elit. Aliquam
hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet
vitae, risus.
: Second definition for term 1, also wrapped in a paragraph
because of the blank line preceding it.
Term 2
: This definition has a code block, a blockquote and a list.
code block.
> block quote
> on two lines.
1. first list item
2. second list item

This format has a lot going for it, in that it covers most of the
requirements for definition lists. In particular, it allows a term to have
multiple definitions, or for multiple terms to share a definition, and for a
definition to have multiple paragraphs, nested lists, code blocks, and other
formatting. There’s only one problem with it, as far as I’m concerned: I would
never write a definition list like this in an email.

I started thinking about alternates, first by thinking about how
I would write a definition list in plain text. It would likely be
something like this:

Term 1
z002d;z002d;z002d;z002d;z002d;z002d;
This is a definition with two paragraphs. Lorem ipsum
dolor sit amet, consectetuer adipiscing elit. Aliquam
hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet
vitae, risus.
Second definition for term 1, also wrapped in a paragraph
because of the blank line preceding it.
Term 2
z002d;z002d;z002d;z002d;z002d;z002d;
This definition has a code block, a blockquote and a list.
code block.
> block quote
> on two lines.
1. first list item
2. second list item

This is much more like what I’ve actually written in the past. From the
point of view of Markdown, however, there are precedents that make it
problematic, namely:

There is no way to tell whether the paragraphs for a given term
constitute a single definition with multiple paragraphs, multiple
definitions, or some combination.

This last item never would have occurred to me, since I have never used
more than one definition per term, but have often used multiple paragraphs in
a single definition. However, when I think about the literal use of
definitions--you know, to define a term, I think about a dictionary, which of
course will offer many definitions for a single term. So clearly, there needs
to be a way to distinguish definitions from paragraphs.

So I started thinking about it some more, trying to figure out why I don’t
like the PHP Markdown Extra syntax, since it solves this problem by using “:”
to identify a term. But then it hit me: Definitions are just a list, and the
“:” is the bullet that identifies a list item. PHP Markdown Extra actually
reinforces this interpretation, since it in all ways makes definitions conform
with basic list syntax. This is a good symmetry and easy to remember.

So why do I hate the syntax? Once I realized this bit about the list, I
immediately knew what I hated about it: “:” is a shitty bullet. As a native
speaker and writer of American English, I no doubt bring my cultural biases to
the table, but I would never use a colon at the beginning of
something, only at the end. It just doesn’t belong there, hanging out in
space. It’s too subtle, conveys no meaning that I can see, and thus have no
obvious mnemonics to make it memorable.

So I started hunting around my keyboard for an alternate, and stumbled almost
at once on the tilde, “~”. Consider this example, which in all ways is just like
the PHP Markdown Extra syntax, except that the colon is replaced with a tilde:

Term 1:
~ This is a definition with two paragraphs. Lorem ipsum
dolor sit amet, consectetuer adipiscing elit. Aliquam
hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet
vitae, risus.
~ Second definition for term 1, also wrapped in a paragraph
because of the blank line preceding it.
Term 2:
~ This definition has a code block, a blockquote and a list.
code block.
> block quote
> on two lines.
1. first list item
2. second list item

Well, okay I did add the trailing colon to the terms, but that’s
just more natural to me, and could be optional. But otherwise, it’s the tilde
that’s different. This to me is much more natural. I’d be perfectly
willing to write a definition list in email messages this way (and I think I
will from now on). It works well with shorter definition lists, too, of
course:

Apple:
~ Pomaceous fruit of plants of the genus Malus in the family Rosaceae.
~ An american computer company.
Orange:
~ The fruit of an evergreen tree of the genus Citrus.

See how nice that is? So, you might ask, why the tilde rather than the
colon? As I said before, the colon doesn’t look right out there in front, it’s
not a “natural” way to write definitions because it’s not a natural choice for
a bullet. The tilde, however, is perfectly comfortable hanging out at the
beginning of a line as a bullet, resembling as it does the dash, already used
for unordered lists in Markdown. Furthermore, it’s already used in
dictionaries. According to Wikipedia, The swung
dash is often used in dictionaries to represent a word that was mentioned
before and is understood, to save space. Not an exact parallel, but at
least the tilde’s cousin the swung dash has to do with definitions! Not only
that, but in mathematics, according to Wikipedia again, the
tilde is often used to denote an
equivalence relation between two objects. That works: a definition is a
series of words that define a term; that is, they are a kind of
equivalent!

So I would like to modestly propose to the Markdown community that the
PHP Markdown Extra definition list syntax be adopted as a standard with this
one change. What do you think?