Paragraph detection

AscToHTM can automatically detect paragraphs in your document.
Normally this is done by detecting blank lines between paragraphs,
but when there are no blank lines other features such as short lines
at the end of a paragraph and an offset at the start of each new
paragraph may also be taken into account.

Indentation detection

AscToHTM performs statistical analysis on the document to determine
at what character positions indentations occur. The software attempts to
honour your indentation pattern by using <BLOCKQUOTE> .. </BLOCKQUOTE>
markup, however the effect can only ever be approximate.

In calculating the indent positions AscToHTM first converts all
tabs to spaces. This may result in unexpected indent positions,
but shouldn't normally be a problem. If it is, adjust the
Tab size policy.

AscToHTM may reject indentations that appear too close together,
so as to keep the number of indent levels manageable.

You can override the analysis by specifying your own indentation
policy. This can sometimes be useful to add an extra indentation
level, or to better match up bullet paragraphs with non-bullet
paragraphs.

Should the analysis fail, you can override any and all of these via the analysis bullet
policies

AscToHTM will use <UL>..<LI>..</UL> or
<OL>..<LI>..</LI> markup for bullets.
This has the effect of putting the bulleted text one level of indentation to the right of
the current text.

Bullet paragraphs

AscToHTM will attempt to detect bullet paragraphs, that is, paragraphs
that belong to the bullet point. To do this it attempts to match
the indentation of follow-on lines with that past the bullet
character(s) on the bullet line itself.

Currently this detection only stretches to the paragraph containing
the bullet.

Possible problems

Numbered bullets may sometimes get confused with numbered
sections. This can be corrected by switching off numbered
sections (if there aren't any), replacing the numbered bullets
by letters or roman numerals, or by moving the numbered bullets
to a different indentation level from the section numbers.

AscToHTM currently only detects the first paragraph
belonging to a bullet. If the bullet has several paragraphs
there may be alignment problems, as the positioning of the second
and subsequent paragraphs will depend on the indentation policy.
Sometimes careful balancing of the indentations and the indentation
policies can sort the problem.

Bullet chars

Bullet chars are lines of the type

- this is a bullet line
- this is a bullet paragraph
because it carries over onto
more lines

That is, a single character followed by the bullet line. AscToHTM
can determine via statistical analysis which character, if any, is being
used in this way. Special attention is paid to the '-' and 'o'
characters.

Numbered bullet detection

AscToHTM can spot numbered bullets. These can sometimes be confused
with section headings in some documents. This is one area where
the use of a document policy really pays dividends in sorting the
sheep from the goats.

Alphabetic bullet detection

AscToHTM detects upper and lower case alphabetic bullets.

Roman Numeral bullet detection

AscToHTM detects upper and lower case roman numeral bullets.

Contents list generation

AscToHTM can detect the presence of a contents list in the original
document, or it can generate a contents list for you from the headings
that it observes. There are a number of policies that give you control
over how and where a contents list is generated (see Contents list policies).

There are four different situations in which contents lists may, or
may not be generated. These are :-

Contents lists in default conversions

By default AscToHTM will not generate a contents list for a file unless
it already has one.

If it should detect a contents list in the document, then that list
is changed into hyperlinks to the named sections. This only works
currently for files with numbered headings.

Where an existing list is detected, headings shown in the contents
list are converted into links, and the link text is that in the
original contents list, and not the text in the actual heading (often
they are different).

Note:

AsctoHTM currently only detected numbered contents lists, and
is occasionally prone to error when they are present. If
you experience problems, either delete the contents list
and get AscToHTM to generate one for you, or mark up the
existing list using the contents pre-processor commands
(see Pre-processor section delimiters)

If an existing contents list is present, it is deleted from the
output. Normally it's best to either use the existing contents list,
or to delete it from the source text and request a generated list.

Contents lists placement

By default the contents list is placed at the top of the output
file. In earlier versions of AscToHTM the contents list was always
placed in a separate file.

You can cause contents lists to be placed wherever you want by using
the CONTENTS_LIST preprocessor command. If you do this,
then contents lists is placed only where you place CONTENTS_LIST
markers.

Generating a contents list in a separate file

If you select the Generate external contents file policy
the contents list is placed in a separate file, and a hyperlink
to that file called "Contents List" is placed at the top of the HTML
page generated from the document.

You can choose the name of the external file using the
External contents list filename policy. If omitted, the file
is called "Contents_<filename>", where <filename> is the name of
the document being converted.

Contents lists in conversions to multiple HTML files

AscToHTM can be made to split the output into many files. At present
this is only possible at detected section headings. Each generated
page usually has a navigation bar, which includes a hyperlink back
to the following section in any contents list.

the options to generate an external contents list in a separate
file are no longer available.

if the contents list is being generated, it is now placed at the
foot of the first document, rather than at the top (unless the
CONTENTS_LIST preprocessor command is used)

This is usually before the first heading (which now starts the
second document), and after any document preamble.

Note:

Where the original contents list is used when splitting files
it is possible that not every file is directly accessible
from the contents list, and that the back links to the contents
list may not function as expected. In such cases you can
go from the contents list to a major section, and then use
the navigation bars to page through to the minor section.

Contents lists in conversions to frames

Contents list generation for the main document will proceed as described
in the previous sections.

When making a set of frames, you can elect to have a contents
frame generated (the default behaviour), and this will have a
generated list placed in a frame on the left. This can mean you
have a contents list in the contents frame on the left, and also
at the top of the first page in the main document. For this reason the
main frame often starts by displaying the second page.

The number of levels shown in the contents frame list can be controlled
by policy. Alternatively you can replace the whole contents of the
contents frame by defining a CONTENTS_FRAME HTML fragment.

Definition detection

AscToHTM will search for definitions. Definitions consist of a
definition term and then the definition description.

One-line definitions

A definition line is a single line that appears to be defining
something. Usually this is a line with either a colon (:) or an
equals sign (=) in it. For example

IMHO = In my humble opinion
Address : Somewhere over the rainbow.

AscToHTM attempts to determine what definition characters are used
and whether they are "strong" (only ever used in a definition) or
"weak" (only sometimes used in a definition).

AscToHTM marks up definition lines by placing a line break on the end
of the line to preserve the original line structure. Where this
decision is made incorrectly unexpected breaks can appear in text.

AscToHTM offers the option of marking up the definition term in
bold. This is not the default behaviour however.

Definition paragraphs

AscToHTM also recognises the use of definition paragraphs such
as :-

Note: This is a definition paragraph whereby the whole
paragraph is defining the term shown on the first line.
Unfortunately AscToHTM currently only copes with single
paragraphs (i.e. not with continuation paragraphs), and
only with single word definitions.

AscToHTM can detect such definitions, subject to the current
limitations

Only one word definition terms are detected

Only the first definition paragraph is detected. Whether or
not subsequent paragraphs are aligned correctly will depend on
the indentation policy applied to it.

These limitations will hopefully be removed in later versions.

Where definition paragraphs are detected the definition can be marked up in <DL>
... <DT>..</DT> <DD>..</DD> </DL> and (optionally) can have
the definition
term highlighted in <B> ... </B> markup.

Text formatting

In addition to various types of formatted text layouts, the software
can detect a number of special types of text formatting, including the
following.

If the line appears centred (and meets a few other conditions) then it will
be rendered centred in the output.

This option is normally left switched off, as it is fairly prone to errors, not
least because the calculation is sensitive to getting the page width calculation
correct. When it goes wrong you are liable to find the document centres lines
that shouldn't be.

Quoted line detection

AscToHTM recognises that, especially in Internet files, it is
increasingly common to quote from other text sources such as e-mail.
The convention used in such cases is to insert a quote character
such as ">" at the start of each line.

Consequently, AscToHTM adds a line break at the end of such lines to
preserve the line structure of the original, and marks it up in
italics to differentiate the quoted text

Emphasis detection

AscToHTM can look for text emphasised by placing asterisks (*) either
side of it, or underscores (_). AscToHTM will convert the enclosed text
to bold and italic respectively using Bold and italic tags
respectively.

AscToHTM will also look for combinations of asterisks and underscores
which will be placed in bold italic. The asterisks
and underscores should be properly nested.

The emphasised word or phrase should span no more than a few lines, and
in particular should not span a blank line. If the phrase is longer,
or if AscToHTM fails to match opening and closing emphasis marks, the
characters are left unconverted.

Tests are made to ignore double asterisks and underscores, and sometimes
adjacent punctuation will prevent the text being marked up.

Only markup that occurs in matched pairs over 2-3 lines will
be converted, so _this and that* won't be converted.

For example the following two paragraphs :-

Here are *bold* and _italic_ words. The phrase _The Guardian Newspaper_
would appear in italics. The words *_this_* and _*that*_ would appear in
bold italics.
The program can cope with phrases such as _Alice in
Wonderland_ which span more than one line.

Becomes

Here are bold and italic words. The phrase The
Guardian Newspaper
would appear in italics. The words this and
that would appear in
bold italics.

The program can cope with phrases such as Alice in
Wonderland which span more than one line.

Unix emphasis character detection

AscToHTM also tries to handle use of Ctrl-H in Unix documents.
In such documents Ctrl-H can be used to overstrike characters.
Common effects are double printing and underlining. Where
detected AscToHTM will use bold and underlining markup.

Examples could include:-

The word this^H^H^H^H____ is underlined.
The word that^H^H^H^Hthat is bold (overwritten twice).

Adding hyperlinks

You can control which features get hyperlinks added by modifying the
available hyperlink policies

Contents List detection

AscToHTM can detect the presence of a contents list in the original
document, or it can generate a contents list for you from the headings
that it observes. There are a number of policies that give you control
over how and where a contents list is generated (see Contents list policies).

AscToHTM will attempt to detect contents list in a number of ways :

By detecting "table of contents" "end contents" or something similar in the
text.

By spotting the numbering sequence has been repeated twice. AscToHTM will
assume the first set is the contents list.

Cross-reference detection

AscToHTM can convert cross-references to other sections into hyperlinks
to those sections. Unfortunately this is currently only possible for
second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

This is because the error rate becomes too high on single numbers/letters
or roman numerals. This may be refined in future releases, although
it's hard to see how that would work.

It is possible to use AscToHTM tags though, for example the
GOTO command and POPUP command can create links
to named sections.

URL detection

AscToHTM can convert any URLs in the document to hyperlinks. This
includes http and FTP URLs and any web addresses beginning with
www.

The domain name part of the URL will be checked against the
known domain name structures and country codes to check it falls
within an allowed group. So www.somewhere.thing won't be allowed as
".thing" isn't a proper top level domain.

URLs that use IP addresses or some more obscure methods of specifying
domain names will also be recognised, but the link will be changed
wherever to either a domain name or an IP address. This will
de-obfuscate any obscure references so beloved by spammers.

Usenet Newsgroup detection

AscToHTM can convert any newsgroup names it spots into hyperlinks to
those newsgroups. Because this is prone to error, AscToHTM currently
only converts newsgroups in known USENET hierarchies such as
rec.gardens by default.

E-mail address detection

AscToHTM can convert any email addresses into hypertext mailto: links.
As with URL detection, the domain name is checked to see it falls into
a recognised group.

User-specified keywords

AscToHTM can convert use-specified keywords into hyperlinks. The words
or phrase to be converted must lie on a single line in the source
document. Care should be taken to ensure keywords are unambiguous.
Normally I mark my keywords in [] brackets if authoring for conversion
by AscToHTM

Headings and section titles

AscToHTM recognises various types of headings. Where headings are found,
and deemed to be consistent with the prevailing document policy (correct
indentation, right type, in numerical sequence etc), AscToHTM will use
the standard "Heading n" styles.

In addition to this, AscToHTM will insert a named bookmark to allow
hyperlink jumps to this point. These bookmarks are used for example
in any cross-reference hyperlinks that AscToHTM generates, and
also by any GOTO tags.

Numbered heading detection

Sections of type N.N.N can be checked for consistency, and
references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and
letters of the alphabet are not fully supported.

Capitalised heading detection

AscToHTM can treat wholly capitalised lines as headings. It also
allows for such headings to be spread over more than one line.

Underlined heading detection

AscToHTM can recognize underlined text (e.g. a row of minus signs),
and optionally promote the preceding line to be a section header.

The "underlining" line should have no gaps in it, and should be a
similar length to the preceding heading. If these conditions aren't
met you'll probably get a horizontal rule instead.

If you're authoring a file from scratch, it is probably best to use
underlined headings for ease of use.

The level of heading associated with an underlined heading depends on the
underline character as follows:-

'****'

level 1

'====','////'

level 2

'----','____','~~~~'

level 3

'....'

level 4

The actual markup that each heading gets may depend on your policies.
In particular level 3 and level 4 headings may be given the same size
markup to prevent the level 4 heading becoming smaller than the text it
is heading. However the logical different will be maintained, e.g.
in a generated contents list, or when choosing the level of heading at which
to split large files into many HTML pages.

Embedded heading detection

The program can look for headings "embedded" in the first paragraph.
Such headings are expected to be a complete sentence or phrase in
UPPER CASE at the start of a paragraph. Where detected the heading
will be marked up in bold, rather than <Hn> markup, although it will
still be added to, and accessible from any hyperlinked contents list
you generate for the document.

Key phrase headings

The program can now look for lines that start with particular words
or phrases (such as "Chapter", "Part", Title") of your choice and
treat these lines as headings. Previously this only worked in a
limited way if the heading line was also numbered ("Chapter 1") etc.

Numbered paragraph detection

Some types of documents use what look like section numbers to number
paragraphs (e.g. legal documents, or sets of rules).

AscToHTM can recognize this, and mark up such lines by placing
the number in bold, and not using the "Heading n" style on the whole
line.

Mail and USENET headers

Some documents, especially those that were originally email or USENET
posts, come with header lines, usually in the form of a number of
lines with a keyword followed by a colon and then some value.

AscToHTM can recognize these (to a limited extent). Where these are
detected the program will parse the header lines to extract the Subject,
Author and Date of the article concerned. A heading containing this
information will then be generated to replace all the unsightly header
lines.

Heading policies in a policy file

AscToHTM has the following section heading policies that will normally be
correctly calculated on the analysis pass :-

AscToHTM will read in such lines from a policy text file, but does not
yet fully supported editing these via the Windows interface.

The syntax is explained below, but this will probably change in future
releases. You can edit these lines in your policy file, and through
the policy options in Windows.

The lines are currently structured as follows

Line component

Value

xxxx

Either "Heading" or "Contents" according
to the part of the policy being described

Level n

Level number, starting at 0 for chapters
1 for level 1 headings etc.

"Some_word"

Any text that may be expected to occur before
the heading number. E.g. "Chapter" or "Section"
or "[". The case is unimportant.

N.Nx

The style of the heading number. This will
ultimately (in later versions) be read
as a series of number/separator pairs.

The proposed format is
"N" = number
"i" / "I" = lower/upper case roman numeral
with an 'x' at the end signalling that trailing
letters may be expected (e.g. 5.6a, 5.6b)

at indent n

The indentation that this heading is expected
at. This is important in helping to eliminate
false candidates.

Pre-formatted text

The software can detect various forms of pre-formatted text. This
is text laid out in such a way that the spacing used is critical.
Spacing is not normally preserved in conversion to HTML, so the
correct detection and handling of these special types of text is
quite important.

You can adjust the sensitivity of AscToHTM to pre-formatted text by
setting the minimum number of lines required for a pre-formatted region
using the Minimum automatic <PRE> size policy.

HTML ignores all white space in the source document, thus any hand-crafted
layout information would normally get lost. When AscToHTM detects
such regions it marks them up in fixed width font which tells HTML
this region is pre-formatted.

When tables are detected, AscToHTM will attempt to generate the correct
HTML table.

When AscToHTM gets the detection wrong you can use the AscToHTM
pre-processor to mark up regions of your document
you wish preserved.

Table detection

Tables are marked out by their use of white space, and a regular pattern
of gaps or vertical bars being spotted on each lines. AscToHTM will
attempt to spot the table, its columns, its headings, its cell alignment
and entries that span multiple columns or rows.

Should AscToHTM wrongly detect the extent of a table, you can mark up
a section of text by using the TABLE pre-processor
markup (see the Tag manual). Alternatively you can try adding
blank lines before and after, as the analysis uses white space to
delimit tables.

Code sample detection

AscToHTM attempts to recognize code fragments in technical documents.
The code is assumed to be "C++" or "Java"-like, and key indicators
are, for example, the presence of ";" characters on the end of lines.

Should AscToHTM wrongly detect the extent of a code fragment, you can
mark up a section of text by using the CODE pre-processor
markup.

ASCII art and diagram detection

AscToHTM attempts to recognize ASCII art and diagrams in documents.
Key indicators include large numbers of non-alphanumeric characters
and the use of white space.

However, some diagrams use the same mix of line and alphabetic
characters as tables, so the two sometimes get confused.

Should AscToHTM wrongly detect the extent or type of a diagram,
you can mark up a section of text by using the DIAGRAM
pre-processor markup.

Text block detection

(New in version 5.0)

If AscToHTM detects a block of text at a large indent, it will now
place that text in such a way as to preserve as faithfully as
possible the original indent.

Other formatted text

If AscToHTM detects formatted text, but decides that it is neither
table, code or art (and it knows what it likes), then the text may be
put out "as normal", but with the original line structure preserved.

In such regions other markup (such as bullets) may not be processed
such as it would be elsewhere.