Both the Link Grammar parser and the RelEx parser generate this format. The Link Grammar parser performs syntactic parsing of input sentences. The RelEx parser performs post-processing on the Link Grammar output to simplify it's structure, and to add some additional markup, such as course-grained posts of speech and word-lemmas.

Usage

This output format has been in use by the OpenCog NLP subsystem since March 2008, and there is a large variety of code, written Java, perl, C++ and scheme, that assumes this particular format. Please do not invent a new format; please avoid modifying this format, if at all possible. Note that not all of the code that uses this format is in the OpenCog source; some is in the RelEx package, and some is in the LextAt package.

There are four primary ways of generating this output:

Using the LgParseLinkatom to invoke the Link Grammar parser. This is currently the preferred/recommended way of generating this format, and placing it into the atomspace.

Using the link-grammar-server.sh shell script, which provides a network server that converts Link Grammar parses into the format described on this page.

Using the opencog-server.sh shell script, which provides a network server that performs Link Grammar parsing, followed by RelEx processing. This server will wait for TCP/IP connections, parse input sentences on the fly, and return the output documented below. The actual formatting is done in the src/java/relex/output/OpenCogScheme.java file.

Using the RelEx src/perl/cff-to-opencog.pl perl script. This is a batch processing script; it will take previously parsed text, stored in the CFF format, and turn it into the format documented here. This is ideal for bulk processing of things like Wikipedia pages, since the high cost of parsing is paid just once, and converting from CFF to OpenCog is very fast.

Note that while new MindAgents could use the OpenCog hypergraphs described here, directly, it probably makes more sense to run input text through the triples, seme and relex-to-frame processing code first, in order to get slightly more abstract and manageable hypergraphs. On the other hand, as of May 2009, the triple, seme, and relex-2-frame code is under construction, and only partially functioning. Caveat Emptor.

Components

The representation of parsed text in OpenCog introduces a number of new Node and Link types. In principle, new node and link types are not really needed; however, by introducing these, it becomes a lot easier to traverse the hypergraph of a parsed sentence, and find the needed/desired information. In addition, the hypergraph representing a parse becomes much smaller.

Most links are given a SimpleTruthValue of strength 1.0, confidence of 1.0. The ParseLink is given a simple truth value with a strength of 1.0, but confidence of a smaller value, as assigned by a simple parse-ranking algorithm.

A fully parsed sentence, "Humans have two feet", is given at the bottom, with examples taken from that parse.

SentenceNode, ParseNode, ParseLink

A SentenceNode serves as an anchor for parses associated with a particular sentence. There is only one SentenceNode per sentence. For example:

(SentenceNode "sentence@2d19c7e7-2e02-4d5e-9cbe-6772174f3f4d")

The name of the sentence node is a unique string, meant to uniquely identify this sentence. It has no particular meaning. Currently, it is in the form of sentence@UUID, where the UUID is a 128-bit MD5 hash. This large UUID is used in order to avoid the birthday paradox for tagging items, and it is printed in ASCII in order to make it human readable (and grep-able, etc.).

A ParseNode serves as an anchor for word instances associated with a particular parse. The ParseNode has a SimpleTruthValue associated with it that provides the parse ranking for that parse. It is expressed as a numerical value for the confidence of that parse. For example:

Note that there may be multiple parses for any given sentence. The parse-node name is numbered, from most-likely to least-likely parse; this is purely for debugging convenience, and should not be assumed by any code.

It might someday be useful to group sentences together into paragraphs, or documents into collections, or to indicate embedded media (tables, graphics, footnotes, margin notes, sounds, movies, etc.). However, at this time, there is no such markup defined. However, see the file seme/README for notes on how input sentences are tagged with the name of a speaker during IRC chatbot sessions, and/or other text sources.

WordInstanceNode, WordInstanceLink, WordSequenceLink

Word instances are unique, individual instances of words occurring in a given parse. The WordInstanceNode is used to represent these. For example:

(WordInstanceNode "feet@bf71826c-487e-42df-a941-0ecd3c942a76")

These are created with unique names because the same word may occur multiple times within one sentence, or one document, and one must be able to tell them apart. Word instances are tagged with feature data, such as tense, number, part-of-speech. As a result, each word instance is associated with a particular parse, since different parses may assign different feature data to different word instances.

The format of name is word@UUID. The word is there to make it human-readable, and thus easier for manual debugging. The UUID is there in order to make sure that every word instance is uniquely identified. The UUID is large, in order to avoid the Birthday Paradox that can occur when tagging items with unique labels. Although the UUID can be anything, in practice, a 128-bit MD5 hash is used. It is spelled out in ASCII to make it human-readable, and thus easier to debug.

There is no guarantee that the issued numberings are sequential over all time; instead, they come from a counter that is restarted whenever the RelEx server is restarted. If long-term, large-scale sequential ordering is needed, a different mechanism should be invented.

Word instances are anchored to parses by means of the WordInstanceLink. Given a ParseNode, it becomes very easy to find all word instances associated with that parse. For example:

Although, in principle, it might have been possible to get the name of the WordInstanceNode, look for the @ sign in it, and take the word to be everything before the @ sign, in practice, this is not reasonable. In particular, it is conventient to refer to WordNodes direction in ImplicationLinks, thus WordNodes are needed.

In the above, a SimpleTruthValue was added to indicate that it is true that this is the word associated with this word-instance, and that we are completely sure of it.

In addtion to knowning the word, it can often be important to know the word lemma. This is accomplished with the LemmaLink. The lemma of feet is foot, and so, for example:

Note that this links together word-instances; this is required, as different parses will have different link-grammar linkages. The new, unique node type LinkGrammarRelationshipNode is used, so as to make it easy to destinguish these links from other EvaluationLinks.

PartOfSpeechLink

Word-instance features are marked with DefinedLinguisticConceptNode, with the exception of part-of-speech, which uses the PartOfSpeechLink. For example:

These are given SimpleTruthValues of 1,1 to indicate that, for this given parse, the feature assignment is true, and completely confident.

An alternative to using DefinedLinguisticConceptNodes in the above is is create NounNumberLink's, VerbTenseLink's, etc. At this time, there does not appear to be any pressing need to use such an alternate format.

DefinedLinguisticConceptNode

Represents concepts that are linguistic-related , they are generated by RelEx and R2L at the moment.

Again, a simple truth value of 1,1 is used to indicate that, for this given parse, we are completely certain that this relationship occurs.

AnchorNode

The AnchorNode is used to tell OpenCog MindAgents that this given sentence has just been input, for the very first time, into the AtomSpace. The AnchorNode is a very simple mechanism to put things in the AtomSpace where other processes can find them. So, if you have "SomeAtoms" and don't want to loose track of them, you create the following:

Then as long as you can remember that your stuff is called "my-stuff", you just have to look at the incoming set for the AnchorNode, which finds the ListLink, which links your stuff. In the case of RelEx,

Thus, mind agents in charge of dealing with recently input text can scan for links to this particular AnchorNode (it will always have the name "# New Parsed Sentence"). These mind-agents can do as they wish with this link: typically, once processing is complete, the list-link is deleted, freeing this sentence from this anchor. Typically, the output of a mind-agent will be attached to some other AnchorNode, so as to pass off processing to other mind agents.