Overview of ConText Linguistics

ConText linguistics is used to analyze the content of English-language documents. You use ConText linguistics to create different views of the contents of documents that allow the user to quickly review the essential content of documents and determine their relevance.

Because these services are separate and distinct from text and theme indexing, you can incorporate linguistic analysis and functionality in a text application, independent of the text/theme indexing process.

ConText linguistics can generate the following forms of linguistic output for documents:

Output Type

Description

Themes

The main concepts of a document.

Gist

Paragraph or paragraphs in a document that best represent what the document is about as a whole.

Theme Summary

Paragraph or paragraphs in a document that best represent a given theme in the document.

Sentence-Level Gist

Sentence or sentences in a document that best represent the themes in the document as a whole.

Sentence-Level Theme Summary

Sentence or sentences in a document that best match a single theme in the document.

You obtain linguistic output by submitting a linguistic request using the CTX_LING PL/SQL package. Linguistic requests can only be processed by ConText servers running with the Linguistic personality.

Requirements

The requirements for using ConText linguistics are:

text stored in a column (either directly or indirectly through a pathname to files)

a policy for the column

ConText server running with Linguistic personality

Note:

The setup requirements of having text in a column and having a policy for the column apply to ConText indexes (text/theme) as well as ConText linguistics. The procedures for storing text and creating policies are not discussed in this manual. For more information about storing text in columns and creating policies for the columns, see Oracle ConText Option Administrator's Guide.

Linguistic Personality

To process requests for linguistic output (themes and Gists), a ConText server with the Linguistic personality must be running. A ConText server with the Linguistic personality can also have other personalities in its personality mask.

Starting up ConText servers is the task of the ConText administrator, through the CTXSYS Oracle user.

Services Queue

The Services Queue is used for managing ConText linguistic requests. Such a request is cached in memory until the requestor submits the request, at which time the request is added to the Services Queue. If more than one request is cached in memory when the user submits the requests, ConText stores all of the requests as a single batch job.

If a ConText server has the appropriate Linguistic personality, the server monitors the Service Queue for requests and processes the next request in the queue.

Note:

If no ConText servers with the' L' personality are running, the Services Queue still accepts requests and holds the requests for the next available ConText server with the appropriate personality.

The ConText administration tool can be used to perform all administration functions on the Services Queue (e.g., cleaning up entries, etc.). In addition, the CTX_SVC PL/SQL package can be used to perform ConText administration from the command-line.

Creating Linguistic Output

You can generate linguistic output in batch during the text indexing process or generate it as needed. Because the generation of linguistic output is independent of the text-indexing process, ConText places no restrictions on when you can create themes and Gists.

Application Program Interface (API)

Linguistic and queue management functions are invoked by using PL/SQL procedures called or executed within the programming language in which the application is developed. If the application is developed in PL/SQL, these procedures may be invoked directly as PL/SQL execute statements. If the application is developed in another language, such as C, the PL/SQL procedures for linguistic and queue management functions are accessed through the Oracle Call Interface (OCI).

ConText provides the following PL/SQL packages for generating linguistic output and managing the Services Queue, respectively:

CTX_LING

CTX_SVC

CTX_LING Package

The stored procedures in CTX_LING are used to request linguistic output and submit the requests to the Services Queue. CTX_LING also provides procedures for specifying user settings for generating linguistic output and enabling logging of parse information generated during the processing of a request.

The model for submitting requests and querying the linguistic output is similar to the two-step query model (CONTAINS procedure) provided within the ConText framework for content-based text retrieval.

For example, to generate themes for a document, you first create a table to store the results of the theme generation, then call CTX_LING.REQUEST_THEMES procedure followed by the CTX_LING.SUBMIT function. ConText stores the results in a theme table. To view the results, issue a SELECT statement to select the theme from the output table.

CTX_SVC Package

The stored procedures in CTX_SVC are used to monitor the Services Queue for the status of specific requests. CTX_SVC can be used to check the status of pending requests, and to display errors encountered. You can also cancel the request if it has not been picked up for processing by a ConText server or clear the request if the request encountered an error.

Linguistic Core

The linguistic core is made up of the following components:

lexicon

knowledge catalog

parsing engine

Lexicon

The lexicon is a static knowledge base that provides word and phrase information for the parsing engine. The lexicon recognizes over one million English words and phrases and defines hundreds of lexical characteristics for each word.

Note:

The lexicon is specific to the English language, but it recognizes the difference between American and British usage and spelling.

Linguistic information about words in the lexicon is divided into the following types:

Information Type

Description

Syntax

Syntax flags provide surface level assessments of a word or phrase isolated from its grammatical context.

Theme

Theme flags identify the thematic qualities of a word (e.g. weak noun/needs support, strong verb). The parser uses these flags to determine how a word contributes to the thematic construction of the sentence as a whole.

Knowledge Catalog

The knowledge catalog is a language-independent organization of industries, fields of study, special terms and jargon, and abstract concepts. It creates a classification scheme that defines ConText's semantic view of the world.

Context uses the knowledge catalog to generate linguistic output, to classify documents by theme during theme indexing, and to normalize theme queries.

Parsing Engine

The parsing engine identifies paragraph, sentence, and token (word) boundaries, as well as phrases and clauses. It then passes the tokens to the lexicon where grammar and theme flags are attached and linguistic analysis begins.

Once the lexicon identifies the grammatical function of each word in a sentence, using the word's placement in the sentence and its relationship to the surrounding words, the parsing engine determines the thematic function of the word in the sentence.

As the parsing engine encounters successively larger text blocks (sentences, paragraphs, and the whole document), it expands the analysis to add new information about the text to its knowledge base.

If case-conversion is enabled, the parsing engine converts all the text to lowercase and processes the text through the case-sensitivity routines to determine proper capitalization.

Note:

Case conversion does not affect the original text of the documents being processed; only the output of the parsing engine is stored in mixed-case.

Text Input Requirements

ConText linguistics has the following requirements and restrictions for text input:

punctuation

paragraph separation

document size

writing styles

case-sensitivity

Punctuation

Each word and sentence should be clearly identified using standard conventions such as blank spaces and recognized punctuation. Complete sentences produce the best results, but are not required. ConText can process incomplete sentences as well as text in headers and lists.

Paragraph Separation

To successfully process text, the ConText requires documents to be separated into paragraphs. The method by which the paragraph delimiters are recognized is based on whether the text is formatted.

Formatted Text

In formatted text, the filters used to extract the text must provide paragraph delimiters that can be recognized by ConText.

The internal filters provided by ConText automatically recognize the paragraph delimiters used in the document format for the filter. Similarly, any external filters used for filtering text must recognize the paragraph delimiters used in the document format for the filter.

Plain (ASCII) Text

With plain (ASCII) text, paragraph delimiters are determined on a per document basis. ConText samples the first 8 Kilobytes of text in a document to identify the common method used to mark the beginning and end of paragraphs in the sample. That method is then applied to the rest of the document.

Document Size

ConText linguistics can process documents of any size, up to a maximum size of 5 megabytes for a single document

.

Note:

If a ConText linguistics request is submitted for a document larger than 5 megabytes, ConText returns an error and does not generate output for the document

Writing Styles

ConText can analyze written material of all styles and subject matter. This includes technical manuals, literature of all types, newspapers and magazines, encyclopedias, and electronic-mail messages.

ConText linguistics is not well-suited for processing transcriptions of unstructured, spoken words, such as colloquial dialogue or casual conversation. ConText linguistics also does not work well with non-natural languages such as computer programming languages.

Case-sensitivity

ConText linguistics depends on text that is properly capitalized, which helps indicate the beginning of sentences and identifies proper nouns. ConText linguistics can also process text that is not in mixed-case, which is especially useful for all-uppercase or all-lowercase text that may exist in legacy systems.

ConText processes mixed-case text by first reducing the text to all lowercase, then analyzing each word to determine if the word should be capitalized or not.

This internal case-conversion takes place only if the appropriate setting has been enabled in the setting configuration for the session.

.

Note:

While linguistic output is stored in mixed-case, the text of the source documents is not converted to mixed-case. The conversion is done internally and used only to facilitate the linguistic analysis.

Linguistic Output

ConText linguistics produces the following output:

theme indexes

lists of themes

theme summaries

Gists

Theme Indexes

Theme indexes are created as a prerequisite for issuing theme queries. Given a theme policy, you can create a theme index for all documents in an entire text column using CTX_DDL.CREATE_INDEX

List of Themes

You can generate a list of themes or list of main concepts of a document on a per document basis. Because themes present a profile of the main subjects of a document, a list of themes provide a snapshot of what the document is about. You can generate up to 16 themes for each document, using the CTX_LING.REQUEST_THEMES procedure. When you generate the themes for a document, each theme is assigned a relative weight.

Theme Weight

Each document theme is assigned a weight that measures the strength of the theme relative to the other themes in the document.

The cumulative weight of a theme also reflects the overall thematic content of the document. As such, theme weights can be used to compare a document theme to other themes within the same document or to other documents with the same theme.

Theme Classification

The themes produced by ConText linguistics are essentially document classifications. Each theme provides information that can be used to classify the document into a semantic world view (classification structure) defined by the user. For this reason, ConText linguistics always normalize the terms and phrases in the theme output to their noun and plural forms, if applicable.

In addition, the theme output is not always a direct result of the actual terms and phrases found in a document. Often the output reflects ConText's understanding of how themes are related.

For example, if a document provides a detailed discussion of MS-DOS and UNIX, ConText returns DOS and UNIX as themes for the document; however, ConText might also return operating systems as a theme, indicating that a relationship exists between DOS and UNIX. The document could be classified under DOS, UNIX, operating systems, or any combination of the three.

Theme Summaries

A theme summary for a document provides a short summary of the document from a specific point-of-view. You can generate two types of theme summaries:

paragraph-level

sentence-level

A paragraph-level theme summary consists of the paragraph or paragraphs that best represent a single document theme A sentence-level theme summary consists of the sentence or sentences that best match a single document theme.

To create either paragraph-level or sentence-level theme summaries, use CTX_LING.REQUEST_GIST.

Because it provides a concise, focused summary for a particular theme in a document, a theme summary can be used to compare documents with similar themes.

You can control the size of sentence-level and paragraph-level theme summaries with linguistic settings.

Note:

The settings for theme summaries can only be modified by creating custom setting configurations in the GUI administration tool.

Gists

A Gist for a document provides a summary that reflects all of the themes in the document. You can generate two types of Gists:

paragraph-level

sentence-level

A paragraph-level Gist consists of the document paragraphs that best represent the themes in a document as a whole. A sentence-level Gist is the sentence or sentences that best represent the themes in a document as a whole.

To generate either a paragraph-level or sentence-level Gist, use CTX_LING.REQUEST_GIST.

Because a Gist is generally longer than a theme summary, it serves better as a document reading tool than a document selection tool. For example, it can be used to quickly scan a document and to extract the most meaningful thematic information.

You can specify settings to control the size of the Gist.

.

Note:

The settings for Gist can only be modified by creating custom setting configurations in the GUI administration tool.

Linguistic Settings

You can perform linguistic processing of documents to generate themes and Gists only when a ConText server with the Linguistic personality is running. The type of processing is determined by the following options:

convert all-uppercase or all lower-case text to case-sensitive text

generate Gists

use the full parser or the theme parser

process unknown terms (full theme parsing vs. limited theme parsing)

There is a default configuration, but you can also set these options by specifying a label with the CTX_LING.SET_SETTINGS_LABEL procedure. A label is a predefined configuration of settings.

Note:

You cannot change the predefined setting configurations that are shipped with ConText. However, you can use the administration tool to create custom setting configurations from the predefined setting configurations.