Sources and tagging

The texts have been
automatically downloaded from the internet and converted from HTML to
SGML
(TEI). The
conversion programs
were written by > Heiki-Jaan Kaalep

One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted.
The texts often contain parts in various languages.

All the rendition information (e.g. italic, bold) has been deleted.
The superscript and subscript are <hi rend="sup"> and <hi
rend="sub"> . UNICODE-entities having the form &#number; have
been converted to SGML-entities.
The conversion of various forms of Estonian letters s and z with caron,
Icelandic letters, Greek letters etc. has resulted in many incorrect
results. Original HTML-lists have been converted to ordinary text with
numbers at the beginning of paragraphs (if the original was a numbered
list) or a hyphen at the beginning of paragraphs (if the original was a
bulleted list). There
are no corrections
or hyphenations in the texts. The entity &quest; stands for
symbols which correct original form is unknown.

The opening
quotation mark is the entity &ldquo; the closing quotation mark is the entity &rdquo;. The division of the texts into paragraphs follows exactly the original
HTML files. One
paragraph, i.e. one unit between <p> and </p> is on one
line. The
text inside paragraphs has been processed by a program called estyhmm;
as a
result, the punctuation marks are separated from wordforms by a space
(except
those punctuation marks that are an integral part of the token, e.g. an
abbreviation or an ordinal number) and the sentences are tagged with
<s>
and </s>. Apart from paragraphs and sentences,
the structure of the texts (e.g. headings, sections, signatures,
appendixes, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file
contents, size, used tags etc.