a section uttered by a particular speaker with a reference to the speaker

time

time expressions

time@from

starting time for a stretch of time

time@to

end time for a stretch of time

time@when

time in question, normalized to the format HH:mm, e.g. 16:30

w

tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot

Guidelines

Errors and spelling variation

Obvious typos and errors should be surrounded by the sic tag but not corrected. Later in lemmatization, they will receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.

I know <sic>th</sic> way.
Your coat is a lovely colour.

Dates

Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:

On <date when=“2015-05-05”>Tuesday</date>

Dates may have rendering, and free standing dates are considered independent <s> units:

<s><datewhen="2015-05-07"rend="bold">Thursday, May 7, 2015</date></s>

Partial dates are possible, such as years or months of years: <date when=“2016”>2016</date> <date when=“2016-03”>March 2016</date>

Names for ranges of dates are supplied using @from and @to, e.g. <date from=“1990” to=“1999”>The 90s</date>.

If a date range is given explicitly, use two tags with @when: <date when=“1990”>1990</date> - <date when=“2000”>2000</date>

Years before 0000 (i.e. BC) receive a minus, but still have four digits: <date when=“-0128”>128 BC</date>

Lists

Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.

The list element has a @type attribute to distinguish the two.

Each list item in an ordered list carries an attribute @n to designate the number. When @n is used, there is no need to make the number into a token as well: the number is considered a part of the styling, and not a token.

List items typically contain one or more paragraphs (<p> elements). Unlike headings, even if the list contains only one paragraph, a <p> element is used to distinguish its text flow (indentation, separation from rest of text), and for consistency in cases where a single list item has multiple paragraphs.

The following example illustrates markup for a numbered list:

<listtype="ordered"><itemn="1"><!-- the number 1 is not a token, even though it appeared in the text--><p>This is the first step

Figures

Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).

Figures that have a caption surround the caption element, and the caption itself contains tokens that are annotated as usual (since they actually appear in, and are part of the text):

<figurerend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>

Figures may contain tokenizable text, e.g. if the figure is meant to be read.

<figurerend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>

Figures without captions or other tokeinzable text are left empty, but enclosed by figure tags:

<figurerend="picture of a valley"></figure>

Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.

Values for rend

Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:

<hirend="bold italic large">The Big Picture</hi>

Quotation marks

Literal quotes are surrounded by the <quote> tags, regardless of whether or not quotation marks are used. But other uses of quotation marks are surrounded by <q>. Compare the following two uses:

Caesar said <quote>veni, vidi, vici</quote>. You could say that was his <q>" motto "</q>.

Footnotes

Footnotes with running text (not bibliographical references realized using numbers hyperlinked to the bibliography) are place at the position immediately after the paragraph that contains the numbered references. The number is surrounded by ref tags, and the note is enclose in note:

<p>
Some long text.<ref>1</ref> Paragraph continues. At the end of this paragraph we'll insert the note.
</p><noteplace="foot"n="1">This is the footnote, which physically appeared at the bottom of the page, which was the middle of the next paragraph.</note><p>
Next paragraph. This one is split across pages, but the footnote does not appear in the middle of it, even though it was there graphically.
</p>

Reference to deleted speakers

If a deleted comments in reddit is not replied to within the context included in the document, it may be ignored. However if the comment is part of a broken thread of responses, it's existence can be encoded using an empty sp tag with the speaker set to DELETED, which can then be referred to in the reply:

<spwho="#DELETED"/><spwho="#kim"whom="#DELETED">
I agree with you.
</sp>

Reference to multiple speakers

If two characters in a work of fiction say the same thing at the same time, tag both speakers in alphabetical order, separated by a comma (without a space), in the sp@who attribute:

<p><spwho="#Fairy,#Narrator">
“No!”
</sp>
we both said at once.
</p>

Tokens with no intervening spaces

Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot: