NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Temporal Processing

Document Dating (Time-stamping)

Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others.

For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms 1995 and Four years after.

Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….

Temporal Information Extraction

Temporal information extraction is the identification of chunks/tokens corresponding to temporal intervals, and the extraction and determination of the temporal relations between those. The entities extracted may be temporal expressions (timexes), eventualities (events), or auxiliary signals that support the interpretation of an entity or relation. Relations may be temporal links (tlinks), describing the order of events and times, or subordinate links (slinks) describing modality and other subordinative activity, or aspectual links (alinks) around the various influences aspectuality has on event structure.

The markup scheme used for temporal information extraction is well-described in the ISO-TimeML standard, and also on www.timeml.org.

<?xml version="1.0" ?><TimeMLxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:noNamespaceSchemaLocation="http://timeml.org/timeMLdocs/TimeML_1.2.1.xsd"><TEXT>
PRI20001020.2000.0127
NEWS STORY
<TIMEX3tid="t0"type="TIME"value="2000-10-20T20:02:07.85">10/20/2000 20:02:07.85</TIMEX3>
The Navy has changed its account of the attack on the USS Cole in Yemen.
Officials <TIMEX3tid="t1"type="DATE"value="PRESENT_REF"temporalFunction="true"anchorTimeID="t0">now</TIMEX3> say the ship was hit <TIMEX3tid="t2"type="DURATION"value="PT2H">nearly two hours </TIMEX3>after it had docked.
Initially the Navy said the explosion occurred while several boats were helping
the ship to tie up. The change raises new questions about how the attackers
were able to get past the Navy security.
<TIMEX3tid="t3"type="TIME"value="2000-10-20T20:02:28.05">10/20/2000 20:02:28.05</TIMEX3><TLINKtimeID="t2"relatedToTime="t0"relType="BEFORE"/></TEXT></TimeML>

To avoid leaking knowledge about temporal structure, train, dev and test splits must be made at document level for temporal information extraction.

TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: TimeBank 1.2

Evaluation is for both entity chunking and attribute annotation, as well as temporal relation accuracy, typically measured with F1 – although this metric is not sensitive to inconsistencies or free wins from interval logic induction over the whole set.

TempEval-3

The TempEval-3 corpus accompanied the shared TempEval-3 SemEval task in 2013. This uses a timelines-based metric to assess temporal relation structure. The corpus is fresh and somewhat more varied than TimeBank, though markedly smaller. TempEval-3 data

Timex normalisation

Temporal expression normalisation is the grounding of a lexicalisation of a time to a calendar date or other formal temporal representation.

Example:

10/18/2000 21:01:00.65

Dozens of Palestinians were wounded in
scattered clashes in the West Bank and Gaza Strip, Wednesday,
despite the Sharm el-Sheikh truce accord.

Chuck Rich reports on entertainment every Saturday

TimeBank

TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: TimeBank 1.2