ProleText Specification

ClariNet's "Invisible Formatting" language, known as
ProleText, provides a
mechanism to embed format control information in documents through the use
of normally unseen characters, namely spaces and tabs at the end of lines
and in apparently "blank" lines.

Spaces and tabs, so long as they don't take a line past the 80 column mark,
do not affect the display of ordinary text documents viewed in a fixed-width
font on most computer screens and on printers. As such, documents can be
prepared that look perfectly normal and can be viewed with almost every
file viewer in existence. However, the documents contain hidden information
on paragraph arrangement, headers, lists and formatting, which can be used
by an intelligent viewing program to display them in a more pleasant or
structured way.

In addition, this specification formalizes certain visible codes which
affect formatting within lines, to allow the formal specification of bold
face and italics/underlining as well as the insertion of hyperlinks and
graphics.

The functionality of the "language" is a subset of HTML. Documents
prepared in Proletext can always be mapped well and easily to HTML.
In general, they can also be mapped to most other rich text file formats,
including Microsoft RTF and many formatter languages, unless they use the
"raw HTML" tag which allows the insertion of unfiltered HTML into the
document.

In general, the semantics of formatting in Proletext match the semantics
of formatting for the corresponding HTML. The same principles of leaving
many layout decisions to the formatting program apply.

Many documents can also be mapped from HTML or other languages to this format
with some loss of information.

In this document, trailing spaces on lines are known as tags. The most
common tags have been assigned the shortest encodings. For example, a
"continuation of paragraph/block" encoding is a single trailing space.

There is one special case. The encoding for "Show this line verbatim, in
a fixed-width font" is to have no trailing space. This means that even
if segments of a document have their trailing space removed, they still
display in a readable way.

Reference Implementation

A freely available reference implementation, which translates Proletext
to HTML. It is known as inform and is available in
inform.tar.Z on
my FTP site.

Lines

Formatted lines in a Proletext document must be kept to no more than 60
characters, allowing up to 19 columns of tag. In general, column 80 should
not be used as many displays wrap in this event. Non-formatted lines
may be up to 79 columns long, and may even be up to 160 characters long if
they are unconcerned about the effect of wrapping. This is particularly
true if they contain links. The behaviour of processors on lines over 160
characters is undefined, though "undefined" never means that the processor
will trap, abort of perform memory overwrites or illegal instructions.

Tag Encoding

In theory, over 300 codes can be defined on lines that are 60 columns long
and have 19 columns for the tag. Tabs, an essential part of tag encoding,
are assumed to be placed every 8 characters, ie. at column 1, 9, 17 and so
on (1-origin). As such, a tab may not be used past column 72.

Almost a billion encodings are possible on blank lines (line-tags). In
theory, if we wished to make the requirement that tabs be set in every column,
far more encodings could be used. The use of control characters, many of which
do not display in any viewer, could expand the space vastly -- so vastly that
any line-based encoding feature could be put in ProleText -- but this is
not the goal of this system.

Lines shorter than 60 columns could of course have many more encodings.
However, the need is not great. A future version of the language might well
shorten the line length to gain more encodings.

For the purposes of this specification, the tags are mapped to integers
through a system encoding units (consisting of a run of spaces and possible
trailing tab) into 4 bits. Each 4-bit nybble encodes the number of spaces,
plus 1. 0 is used to indicate the tag is over, and that no tab is present
after the spaces of the previous tag. The low-order nybble is the first
sequence of spaces.

This encoding into integers is here only for the purpose of the reference
implementations and the definitions in the header file
invis.h. This
encoding has holes, and does not encode all possible tags, and not all
integers are tags. However, it encodes more than enough tags for current
and foreseen purposes.

Tags are thus defined in the header file with tuples. For example,
the tag (3,2,1) means 3 spaces, a tab, 2 spaces, a tab, one space.

A very few tags (mostly line-tags) have arguments, and in this case the nybble
encoding is important, since any tag with the proper low order nybbles is
considered an instance of the tag, and the high order nybbles provide the
arguments.

Start/End line-tags

All Proletext blocks must start with HEADER line-tag. This is a
a special tag that has an argument. All lines beginning with (2,2,0)
(space-space-tab-space-space-tab-[optional tab and more tuples]) are
header tags. They indicate the start of a Proletext block. While normally
an entire document will be in Proletext and as such this line will be the
first line, it is possible to have a document slip in and out of Proletext,
and more notably, in and out of different versions of Proletext.

It is possible that the header line-tag may appear out of band, in message
headers. In this case, the software would start processing a document
assuming it was in Proletext from the start, with the appropriate version
information. How to do this is beyond the scope of this spec. It was
decided to include the start line-tag so that Proletext documents could
exist outside of any type headers, including Mime type headers. (Sadly,
while the Mime spec states that unknown Mime "text" types should be treated
as plain text, some popular software does not do this.) This way Proletext
documents can be kept in ordinary files, mailed or posted to news easily.
The worst case is that they simply get viewed as ordinary text.

The header line-tag comes with 3 additional elements in the tuple.
A full header tag is (2,2,0,docmajor,docminor,minversion).

docmajor is the major component of the Proletext version number
for this document. This means currently only 15 major version numbers are
possible. docminor is a minor component of this version number.

The value minversion is the level of the earliest processor that
can handle this document. It is not expected that this will commonly be
more than 0. For it to change, there must be such a major revision of the
language so that older processors will simply be unable to handle the new
documents at all. Remember that normally old processors can handle new
features by just displaying them as plain text. However, should a new
functionality, such as forms, be added, in a way that doesn't match the
extensible operators feature, it is possible that documents in this format
would declare to old processors that they should not attempt to format
the document. This could also be used if the fundamentals of the language
are changed.

The TRAILER line-tag indicates the end of Proletext. If this is the
last line in a document, it is not needed. Lines after this should not
be formatted until a another HEADER appears. Note that this could be
used to have different parts of a document be formatted in different
versions of Proletext.

Text Blocks & Continuation

Like most text formatting languages, text is processed in blocks or paragraphs,
a series of lines that will be joined together. In Proletext 1.0, the
first line in any block contains the tag for the block. Any lines to be
joined to that first line (continuation lines) get a tag of CONTINUATION
(1). This allows processing software to know from the first line what it
is going to do with a block or paragraph.

A very small number of blocks treat the different lines of a multi-line
block differently.

Truly blank lines indicate a paragraph break.

General Tags

PARA (2)

This block is an ordinary paragraph. HTML: <P> [Block] </P>

BREAK (0,0)

This is ordinary text (formatted as desired) with a line break after it.
HTML: [Block] <BR>

MONO (No tag)

This text is to be output verbatim, in a fixed width font. It may be
tabular material. In HTML, blocks of this text should be bounded with
<PRE> ... </PRE> or <XMP> ... </XMP> tags.

CENTHEAD (1,0)

The block is to be centered, and is a header of some sort.

H1 (2,0)

The block is a first level header. HTML: <H1> [Block] </P>

H2 ... H5 (3,0) ... (6,0)

The blocks are lower level headers, levels 2 through 5, as the HTML
header tags.

TITLE (7,0)

The block is a document title. Same semantics as HTML <TITLE> [Block] </TITLE>

H1TITLE (8,0)

The block is both top level header and document title. However, only the
first line of a multi-line block is made the document title. The entire
block is a top level header.

LI (3)

The block is an element in a list, or the definition part of a term/definition
pair in a definition block. HTML is either <LI> [Block] or
<DD> [Block] depending on context. If the first few characters of
the block, after initial whitespace are "* ", "o " or "nnn)" or "nnn."
where nnn is a decimal number or one or two character alphanumeric string,
these characters should be removed, or optionally interpreted, to be
replaced by the list enumeration method being used.

POINT (8)

The block is the term portion of a term/definition pair in a definition block.
HTML: <DT> [Block]

RAW (4)

This block contains raw HTML. Non-HTML display programs should do their
best to display this block, but the behaviour of this block in non-HTML
programs is undefined.

COMMENT (5)

This block should not be included in the formatted output.

LINK (6)

This block contains a hypertext link to a URL. The URL should be repeated
as the selection text for the user. If the URL begins, not with a protocol,
but with the string "www.", it should be prefaced with the protocol string
"http://" as it is presumed to be a web server.

LINK2 (9)

This is a special form of the LINK tag. The block should be multi line.
The first line is the URL. Subsequent lines are the text that should be
presented to the viewer to indicate the link. If the block is single line,
it is treated like a LINK.

IMAGE (7)

The block contains the URL of an image file to be inserted into the document,
along with any options for that insertion. In HTML: <IMG src=[Block] >

HR (0,1)

The block -- usually single line -- represents a horizontal line or other
separator in the document. Usually in HTML the contents of the block will
be discarded and replaced with an <HR> HTML tag. However, processors may
analyse the block to decide which type of rule is appropriate at their
discretion.

NOTE (1,1)

This block should be treated as a note, to be somehow set off specially
from the main text. In HTML: <NOTE> [Block] </NOTE>

Undefined Tags

Should an unknown tag be detected, the lower 3 bits of its integer encoding
in the reference implementation should be examined. In general, this is
the number of spaces at the start of the tag, from 0 to 7. Based on this
number:

4

Treat the block as a verbatim, fixed with block, but append some notification
to the lines that they are probably formatted incorrectly.
A hypertext link
to the URL http://proletext.clari.net/prole/lineform.html>[Bad Format]</a> is
recommended.

5

Treat the block as a BREAK block.

6

Treat the block as a PARA block.

7

Treat the block as a COMMENT block.

Others (0 .. 3)

Treat the block as a MONO block.

line-tags

The following line-tags affect more global behaviour. Unlike HTML, some
context is kept. Some of these tags start text regions with globally
different behaviour. Instead of having start and end tags for each
type of text region, many of these tags cause the old formatting
attributes to be pushed on a stack, so that they can be restored by an
END line-tag. This is not like HTML, but like many other formatting
languages. It can easily be mapped to HTML.

BLANK (0)

A truly blank line-tag simply implies a paragraph break. However, if
there was a paragraph break in the previous text block, a paragraph break
should not be generated. Multiple BLANK tags should cause multiple
paragraph breaks after this one elision, however.

END (1)

End a text region, restore to the state before the region began.

END2 (2)

End two text regions at once.

END3 (3)

End three text regions at once.

END4 (4)

End four text regions at once.

TABLE (1,2)

This is not currently defined. However, for the future, it is planned that
the text will be a table, and that fancy processors will parse the table in
its ASCII, monospace form, and render it nicely. Current processors should
just render the table in fixed-width.

UL (3,1)

The text region is an unordered list, with each element marked with a LI
tag.

OL (3,2)

The text region is an ordered list, with each element marked with a LI
tag.

DIR (3,3)

The text region is directory of items, with each element marked with a LI
tag. Items should be short, and can be put in columns.

RAW (3,4)

PRE (1,1)

QUOTE (3,5)

The text region is a block quote, usually intended to be indented or specially
represented.

CENTER (3,6)

All text in the text region should be centered, if this is appropriate

DEFL (3,7)

The text region is a definition list, consisting of "terms" (blocks with
a POINT tag) and "definitions" (blocks with an LI tag).

MAGLINKS (4,1)

At this point in the document, the processor is encouraged, but not required,
to put in a hypertext link to provide help on the concept of invisibly
formatted documents. The URL
http://proletext.clari.net/prole/help.html is
available for this purpose. A link is not required. Help may also exist
on menus or in some other location. This line-tag simply indicates the
document authors felt help might be particularly useful on this document.

PLAINLINK (5,1)

At this point in the document, the processor is strongly encouraged
to put in a hypertext link or other facility to allow the user to view the
document as plain text, without any formatting processing. Document authors
will use this line-tag when they are converting large numbers of documents,
and fear that there may be formatting errors -- particularly the rendering
of tabular data as ordinary text to be line-wrapped. As this renders the
tabular data unreadable, a link or menu item to allow the viewing of a
document in plain text will allow the table to be viewed. Viewers for
Proletext are actually encouraged to always have a facility to turn
Proletext viewing off, in the rare event that an unformatted document
accidentally contains a HEADER tag.

ANCHOR (4,2)

The processor is to place a hypertext anchor at this point, with the name
"a#" where "#" is the index number of the anchor. Ie. the first anchor is
"a0" and the second is "a1" and so on.

TRAILER (2,3,0)

This marks the end of Proletext processing. It is not needed at the end
of a document. Subsequent lines should be treated as plain text, until
a Proletext HEADER line-tag.

EMPTY (2,5,0)

All tags starting with this 3-tuple are EMPTY line tags. The 3-part
2-sp-tab-5sp-tab-0sp is removed, and the remaining tag is used as an
ordinary (text line) tag, applied to a blank line. This allows the
creation of list elements, titles, etc. that are blank. If the plain
tag does not make sense with a blank line, the actions of this tag are
undefined.

Line-Tag definition

Documents that use new tags but which wish them to be handled by earlier
processors and viewers may use a simple macro facility to give the
earlier processors some idea about how to handle the tag.

Any line-tag whose integer representation in the reference implementation
has bit 0x80000000 (hexadecimal) set is a definition for a new line-tag.
If the processor already knows how to handle such an line-tag, it should
ignore the definition.

The definition comes on the next line, which should otherwise be ignored.
The line-tag of the next line becomes the definition for the new line-tag.
If the new line-tag is encountered, the old processor should implement it
with the old tag it has been mapped to.

This allows documents to define a new line-tag and say, "old processors,
please just treat this as PRE" or any other existing tag. Definitions
should nest, so that in theory a document could describe a chain of mappings,
and the processor should use the highest line-tag in the chain that it knows.

Mapping chains may not be more than 10 levels deep.

Of course, this can result in lots of "meaningless" blank lines at the
start of a document in the extreme case. It is hoped that this will not
become too common.

Undefined line-tags

If a processor detects an undefined line-tag, it should examine once again
the lower 3 bits of its reference implementation encoding, which is to
say, the number of spaces starting the line-tag, from 0 to 7. Depending
on this number:

Consider also inserting
a link or other mechanism to allow the document to be viewed unformatted.

7

Treat as a RAW text region, to be presented unformatted. Push this state
on the stack, so the next END clears it.

Inline substitutions

Text in general lines is considered plain, so if mapping to HTML, characters
such as "<" and "&" should be properly treated for viewing.

However, certain special mappings are defined to insert codings within lines.
These of course can't be entirely invisible.

Bold-STRONG

Bold text should be bounded with " *" to turn on bold and "* " to turn it
off. The space in the latter case may be the end of line, and in particular
must not be present at the end of a line to avoid interfering with tags.

The character on the other side of the star must not be a space or a
star, so that " * " has no special meaning, and " **" and "** " have no
special meaning.

It is recommended that the HTML <STRONG> tag be used instead of bold.

Bold text does not extend past the end of a plain text line.
It is automatically turned off at the end of the line and must be turned
on again at the start of the next line.

Italics

The semantics for italics/underlining ar the same as for bold-strong, but
the character "_" is used instead of "*".

Special escapes

#*

Outputs a "*" for those who just must encode " *" without turning on bold.

#_

Outputs a "_" for those who just must encode " _" without turning on italics.

#&

Outputs a "#", the odd encoding so that runs of ####### can be used at their
proper size.

#<

Outputs "<a href=" -- and if the following text starts with "www." also
includes the string "http://" after the "href=" portion. In non HTML
processors, this is the sign that a URL is beginning.

#> or #}

Outputs a raw > into the HTML stream. Non-HTML processors should mark this
as the end of a URL or Image inclusion, whichever is currently pending.

#:

Outputs a raw </A> into the HTML stream. Non-HTML processors should consider
this the end of the text contents of a link. Thus a block will include
text like #< www.clari.net #> ClariNet home page #: and the processor
is expected to duplicate the HTML semantics of:
<A href=http://www.clari.net> ClariNet home page </A>

#{

Outputs <img src= into the raw HTML text stream, indicating an image URL
to be inserted into the text, along with options, to be terminated by
a # escape sequence.

URL escapes

A clear URL in the text not being handled in another way should be mapped
to <a href="URL">URL<a>, the way that most web
browsers handle such URLs in plain text articles they display. A URL must
begin with a protocol, internal document URLS will not be detected this way.
The character before the URL affects how it should be parsed. If it is a
single or double quote, the URL should be parsed to the closing quote or
end of line (no URL may take more than one line.) If the character before
the URL is an open bracket/paren/anglebracket/brace, then the URL terminates
on the appropriate closing character or whitespace. Whitespace terminates
any URL not enclosed in quotes.

Better inline syntax

It is admitted that a better syntax for inline processing is desired.
At present, it was not desired to require processors to have a sophisticated
parser that could detect complex but pretty looking multi-line inline
syntaxes for links, images and other special attributes and tags. Future
versions of the system may support this, as well as table processing.