[September 18, 2001] A research project coordinated through the Tokyo Cyber Assist Research Center has developed an XML vocabulary and DTD for linguistic annotation of web documents. The Global Document Annotation Initiative research team has proposed this XML-based tag set to help computing machines "automatically infer the underlying semantic/pragmatic structure of documents. The tag set is being developed so as to be easy to embed into TEI, EAGLES, and HTML vocabularies. The GDA tag set is designed so that the GDA-annotation reduces the ambiguity in mapping a document to a sort of entity-relation graph (or semantic network) representing the underlying semantic structure. The tag set does not directly encode such graphs, though it should be straightforward to encode them with RDF or related tag sets such as DAML. A chief goal of the GDA iniative is to support AI applications such as machine translation, information retrieval, information filtering, data mining, consultation, expert systems, and so on."

Features of Automatic Text Summarization: "(1) A domain/style-free algorithm using spreading activation on an intra-document network of text segments connected via syntactic, semantic, and rhetorical relations and coreference relations. (2) Linguistic manipulation such as coreferential subsitution and parse-tree pruning. (3) A flexible summary generator which can dynamically generate summaries of various sizes. (4) A personalization mechanism which can reflect readers' interests and preferences."

From the Introduction to the tag set:

To optimize the benefit per cost of tagging, we try to design as simple a tag set as possible which captures enough contents for practical applications. The semantic and pragmatic content of an utterance might be unlimitedly complex due to the complexity of the context. An appropriate degree of complexity of the tag set could be identified, however, because the present technology concerning natural language can effectively process only limited sorts of information. For instance, tagging for metonymy may not be very useful. The tag set should go along with the contemporary state of the art. We can refine the tag set when more detailed tags become useful as technology advances.

We do not restrict ourselves to any single NLP/AI application, but try to address as many aspects of language which seem useful in one of translation, retrieval, summarization, question answering, case-based reasoning, presentation, and so forth. Users interested in only some of these applications may want to use subsets of the tag set. In this connection, the GDA tags are almost entirely optional, as application technologies do not normally require exhaustive tagging. In fact, many relatively simple untagged sentences can be analyzed right by the current technology. So we have tried to design the GDA tag set in such a way that more minute annotation entails more information; in particular, if you do not annotate then you do not commit yourself to any specific interpretation.

The GDA tag set is not specific to any particular language, though the example passages below are mostly in English. The usage of the tags is subject to some customization for particular languages, but we want to use the same vocabulary for the sake of coordination across different languages. Of course different tagging manuals are necessary for different languages. However, we hope to design the tag set so that it is easy for you to write such a manual once you have understood the idea behind the tag set.

The tag set is not a linguistic theory. It encodes semantic and pragmatic structures of documents, remaining somewhat neutral among linguistic theories. Encoding semantic or pragmatic structures and capturing linguistic generalizations are different issues. In particular, we will sacrifice syntactic generalization very often, because syntax is not our primary concern but used as a partial aid for encoding semantics and pragmatics. This could be justified because people probably have better intuition about semantics and pragmatics than about syntax. Of course linguistic theories are very helpful in designing the tag set, but what is important is that the tag set can represent the semantic and pragmatic structure of a wide range of documents, but not that it captures linguistic generalizations. Needless to say, we will attempt to capture as much linguistic generalization as possible as far as we do not sacrifice clarity and ease of encoding semantic and pragmatic structures.

In principle, null annotation entails no information in the GDA tag set. This is to allow partial annotation. In particular, nothing is meant by the absense of a tag or an attribute. For instance, lack of specification of the scope of a plural noun phrase (as an alleged quantifier) does not mean that the noun phrase has no scope. The GDA tag set sometimes allows you to entail some information by lack of annotation, but the tag set is designed in such a way that you should be aware that you are meaning something with null annotation in such cases.