About Oracle Text Indexes

An Oracle Text index is an Oracle domain index.To build your query application, you can create an index of type CONTEXT and query it with the CONTAINS operator.

You create an index from a populated text table. In a query application, the table must contain the text or pointers to where the text is stored. Text is usually a collection of documents, but can also be small text fragments.

For better performance for mixed queries, you can create a CTXCAT index. Use this index type when your application relies heavily on mixed queries to search small documents or descriptive text fragments based on related criteria such as dates or prices. You query this index with the CATSEARCH operator.

To build a document classification application, you create an index of type CTXRULE. With such an index, you can classify plain text, HTML, or XML documents using the MATCHES operator. You store your defining query set in the text table you index.

If you are working with XMLtype columns, you can create a CTXXPATH index to speed up queries with ExistsNode.

You create a text index as a type of extensible index to Oracle using standard SQL. This means that an Oracle Text index operates like an Oracle index. It has a name by which it is referenced and can be manipulated with standard SQL statements.

The benefits of a creating an Oracle Text index include fast response time for text queries with the CONTAINS, CATSEARCH, and MATCHES Oracle Text operators. These operators query the CONTEXT, CTXCAT, and CTXRULE index types respectively.

Structure of the Oracle Text CONTEXT Index

Oracle Text indexes text by converting all words into tokens. The general structure of an Oracle Text CONTEXT index is an inverted index where each token contains the list of documents (rows) that contain that token.

For example, after a single initial indexing operation, the word DOG might have an entry as follows:

DOG DOC1 DOC3 DOC5

This means that the word DOG is contained in the rows that store documents one, three and five.

For more information, see optimizing the index in this chapter.

Merged Word and Theme Index

By default in English and French, Oracle Text indexes theme information with word information. You can query theme information with the ABOUT operator. You can optionally enable and disable theme indexing.

The Oracle Text Indexing Process

This section describes the Oracle Text indexing process.You initiate the indexing process with the CREATEINDEX statement. The goal is to create an Oracle Text index of tokens according to the parameters and preferences you specify.

Figure 2-1 shows the indexing process. This process is a data stream that is acted upon by the different indexing objects. Each object corresponds to an indexing preference type or section group you can specify in the parameter string of CREATEINDEX or ALTERINDEX. The sections that follow describe these objects.

Figure 2-1 Oracle Text Indexing process

Datastore Object

The stream starts with the datastore reading in the documents as they are stored in the system according to your datastore preference. For example, if you have defined your datastore as FILE_DATASTORE, the stream starts by reading the files from the operating system. You can also store you documents on the internet or in the Oracle database.

Filter Object

The stream then passes through the filter. What happens here is determined by your FILTER preference. The stream can be acted upon in one of the following ways:

No filtering takes place. This happens when you specify the NULL_FILTER preference type. Documents that are plain text, HTML, or XML need no filtering.

Formatted documents (binary) are filtered to marked-up text. This happens when you specify the INSO_FILTER preference type.

Text is converted from a non-database character set to the database character set. This happens when you specify CHARSET_FILTER preference type.

Sectioner Object

After being filtered, the marked-up text passes through the sectioner that separates the stream into text and section information. Section information includes where sections begin and end in the text stream. The type of sections extracted is determined by your section group type.

The section information is passed directly to the indexing engine which uses it later. The text is passed to the lexer.

Lexer Object

The lexer breaks the text into tokens according to your language. These tokens are usually words. To extract tokens, the lexer uses the parameters as defined in your lexer preference. These parameters include the definitions for the characters that separate tokens such as whitespace, and whether to convert the text to all uppercase or to leave it in mixed case.

When theme indexing is enabled, the lexer analyses your text to create theme tokens for indexing.

Indexing Engine

The indexing engine creates the inverted index that maps tokens to the documents that contain them. In this phase, Oracle uses the stoplist you specify to exclude stopwords or stopthemes from the index. Oracle also uses the parameters defined in your WORDLIST preference, which tell the system how to create a prefix index or substring index, if enabled.

Partitioned Tables and Indexes

You can create a partitioned CONTEXT index on a partitioned text table. The table must be partitioned by range. Hash, composite and list partitions are not supported.

You might create a partitioned text table to partition your data by date. For example, if your application maintains a large library of dated news articles, you can partition your information by month or year. Partitioning simplifies the manageability of large databases since querying, DML, and backup and recovery can act on single partitions.

Querying Partitioned Tables

To query a partitioned table, you use CONTAINS in the SELECT statement no differently as you query a regular table. You can query the entire table or a single partition. However, if you are using the ORDERBYSCORE clause, Oracle recommends that you query single partitions unless you include a range predicate that limits the query to a single partition.

Creating an Index Online

When it is not practical to lock up your base table for indexing because of ongoing updates, you can create your index online with the ONLINE parameter of CREATE INDEX. This way an application with heavy DML need not stop updating the base table for indexing.

There are short periods, however, when the base table is locked at the beginning and end of the indexing process.

Parallel Indexing

Oracle Text supports parallel indexing with CREATEINDEX.

When you issue a parallel indexing command on a non-partitioned table, Oracle splits the base table into partitions, spawns slave processes, and assigns a different partition to each slave. Each slave indexes the rows in its partition. The method of slicing the base table into partitions is determined by Oracle and is not under your direct control. This is true as well for the number of slave processes actually spawned, which depends on machine capabilities, system load, your init.ora settings, and other factors. The actual parallel degree may not match the degree of parallelism requested.

Since indexing is an I/O intensive operation, parallel indexing is most effective in decreasing your indexing time when you have distributed disk access and multiple CPUs. Parallel indexing can only affect the performance of an initial index with CREATEINDEX. It does not affect DML performance with ALTERINDEX, and has minimal impact on query performance.

Since parallel indexing decreases the initial indexing time, it is useful for

data staging, when your product includes an Oracle Text index

rapid initial startup of applications based on large data collections

application testing, when you need to test different index parameters and schemas while developing your application

Limitations for Indexing

Columns with Multiple Indexes

A column can have no more than a single domain index attached to it, which is in keeping with Oracle standards. However, a single Text index can contain theme information in addition to word information.

Indexing Views

Oracle SQL standards does not support creating indexes on views. Therefore, if you need to index documents whose contents are in different tables, you can create a data storage preference using the USER_DATASTORE object. With this object, you can define a procedure that synthesizes documents from different tables at index time.

Considerations For Indexing

You use the CREATEINDEX statement to create an Oracle Text index. When you create an index and specify no parameter string, an index is created with default parameters.

You can also override the defaults and customize your index to suit your query application. The parameters and preference types you use to customize your index with CREATEINDEX fall into the following general categories.

Type of Index

With Oracle Text, you can create one of four index types with CREATEINDEX. The following table describes each type, its purpose, and what features it supports:

Index Type

Description

Supported Preferences and Parameters

Query Operator

Notes

CONTEXT

Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML or plain text.

With a context index, you can customize your index in a variety of ways.

This index type requires CTX_DDL.SYNC_INDEX after DML to base table.

All CREATEINDEX preferences and parameters supported except for INDEXSET.

These supported parameters include the index partition clause, and the format, charset, and language columns.

CONTAINS

Grammar is called the CONTEXT grammar, which supports a rich set of operations.

The CTXCAT grammar can be used with query templating.

Supports all documents services and query services.

Supports indexing of partitioned text tables.

CTXCAT

Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments. Other columns in the base table, such as item names, prices and descriptions can be included in the index to improve mixed query performance.

This index type is transactional, automatically updating itself after DML to base table. No CTX_DDL.SYNC is necessary.

INDEXSET

LEXER (theme indexing not supported)

STOPLIST

STORAGE

WORDLIST (only prefix_index attribute supported for Japanese data)

Format, charset, and language columns not supported.

Table and index partitioning not supported.

CATSEARCH

Grammar is called CTXCAT, which supports logical operations, phrase queries, and wildcarding.

The CONTEXT grammar can be used with query templating.

The size of a CTXCAT index is related to the total amount of text to be indexed, number of indexes in the index set, and number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set.

The CTXCAT index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)

CTXRULE

Use CTXRULE index to build a document classification or routing application. The CTXRULE index is an index created on a table of queries, where the queries define the classification or routing criteria.

Only the BASIC_LEXER type supported for indexing your query set.

Queries in your query set can include ABOUT, STEM, AND, NEAR, NOT, and OR operators.

Location of Text

Your document text can reside in one of three places, the text table, the file system, or the world-wide web. When you index with CREATE INDEX, you specify the location using the datastore preference. Use the appropriate datastore according to your application.

The following table describes all the different ways you can store your text with the datastore preference type.

Datastore Type

Use When

DIRECT_DATASTORE

Data is stored internally in a text column. Each row is indexed as a single document.

Your text column can be VARCHAR2, CLOB, BLOB, CHAR, or BFILE. XMLType columns are supported for the context index type.

MULTI_COLUMN_DATASTORE

Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one document per row.

DETAIL_DATASTORE

Data is stored internally in a text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.

FILE_DATASTORE

Data is stored externally in operating system files. Filenames are stored in the text column, one per row.

NESTED_DATASTORE

Data is stored in a nested table.

URL_DATASTORE

Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.

USER_DATASTORE

Documents are synthesized at index time by a user-defined stored procedure.

Indexing time and document retrieval time will be increased for indexing URLs since the system must retrieve the document from the network.

Document Formats and Filtering

Formatted documents such as Microsoft Word and PDF must be filtered to text to be indexed. The type of filtering the system uses is determined by the FILTER preference type. By default the system uses the INSO_FILTER filter type which automatically detects the format of your documents and filters them to text.

Oracle can index most formats. Oracle can also index columns that contain documents with mixed formats.

No Filtering for HTML

If you are indexing HTML or plain text files, do not use the INSO_FILTER type. For best results, use the NULL_FILTER preference type.

Filtering Mixed Formatted Columns

If you have a mixed format column such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, you tag each row TEXT or BINARY. Rows that are tagged TEXT are not filtered.

For example, you can tag the HTML and plain text rows as TEXT and the Microsoft Word rows as BINARY. You specify the format column in the CREATE INDEX parameter clause.

Custom Filtering

You can create your own custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java stored procedure.

For external custom filtering, use the USER_FILTER filter preference type.

Bypassing Rows for Indexing

You can bypass rows in your text table that are not to be indexed, such as rows that contain image data. To do so, create a format column in your table and set it to IGNORE. You name the format column in the parameter clause of CREATE INDEX.

Document Character Set

The indexing engine expects filtered text to be in the database character set. When you use the INSO_FILTER filter type, formatted documents are converted to text in the database character set.

If your source is text and your document character set is not the database character set, you can use the INSO_FILTER or CHARSET_FILTER filter type to convert your text for indexing.

Mixed Character Set Columns

If your document set contains documents with different character sets, such as JA16EUC and JA16SJIS, you can index the documents provided you create a charset column. You populate this column with the name of the document character set on a per-row basis. You name the column in the parameter clause of the CREATE INDEX statement.

Document Language

Oracle can index most languages. By default, Oracle assumes the language of text to index is the language you specify in your database setup.

You use the BASIC_LEXER preference type to index whitespace-delimited languages such as English, French, German, and Spanish. For some of these languages you can enable alternate spelling, composite word indexing, and base letter conversion.

Languages Features Outside BASIC_LEXER

With the BASIC_LEXER, Japanese, Chinese and Korean lexers, Oracle Text provides a lexing solution for most languages. For other languages such as Thai and Arabic, you can create your own lexing solution using the user-defined lexer interface. This interface enables you to create a PL/SQL or Java procedure to process your documents during indexing and querying.

You can also use the user-defined lexer to create your own theme lexing solution or linguistic processing engine.

Indexing Multi-language Columns

Oracle can index text columns that contain documents of different languages, such as a column that contains documents written in English, German, and Japanese. To index a multi-language column, you need a language column in your text table. Use the MULTI_LEXER preference type.

You can also incorporate a multi-language stoplist when you index multi-language columns.

Indexing Special Characters

When you use the BASIC_LEXER preference type, you can specify how non-alphanumeric characters such as hyphens and periods are indexed with respect to the tokens that contain them. For example, you can specify that Oracle include or exclude hyphen character (-) when indexing a word such as web-site.

These characters fall into BASIC_LEXER categories according to the behavior you require during indexing. The way the you set the lexer to behave for indexing is the way it behaves for query parsing.

Some of the special characters you can set are as follows:

Printjoins Character

Define a non-alphanumeric character as printjoin when you want this character to be included in the token during indexing.

For example, if you want your index to include hyphens and underscore characters, define them as printjoins. This means that words such as web-site are indexed as web-site. A query on website does not find web-site.

Skipjoins Character

Define a non-alphanumeric character as a skipjoin when you do not want this character to be indexed with the token that contains it.

For example, with the hyphen (-) character defined as a skipjoin, the word web-site is indexed as website. A query on web-site finds documents containing website and web-site.

Other Characters

Other characters can be specified to control other tokenization behavior such as token separation (startjoins, endjoins, whitespace), punctuation identification (punctuations), number tokenization (numjoins), and word continuation after line-breaks (continuation). These categories of characters have defaults, which you can modify.

Case-Sensitive Indexing and Querying

By default, all text tokens are converted to uppercase and then indexed. This results in case-insensitive queries. For example, separate queries on each of the three words cat,CAT, and Cat all return the same documents.

You can change the default and have the index record tokens as they appear in the text. When you create a case-sensitive index, you must specify your queries with exact case to match documents. For example, if a document contains Cat, you must specify your query as Cat to match this document. Specifying cat or CAT does not return the document.

To enable or disable case-sensitive indexing, use the mixed_case attribute of the BASIC_LEXER preference.

Base-Letter Conversion for Characters with Diacritical Marks

Some languages contain characters with diacritical marks such as tildes, umlauts, and accents. When your indexing operation converts words containing diacritical marks to their base letter form, queries need not contain diacritical marks to score matches. For example in Spanish with a base-letter index, a query of energía matches energía and energia in the index.

However, with base-letter indexing disabled, a query of energía matches only energía.

You can enable and disable base-letter indexing for your language with the base_letter attribute of the BASIC_LEXER preference type.

Alternate Spelling

Languages such as German, Danish, and Swedish contain words that have more than one accepted spelling. For instance, in German, the ä character can be substituted for the ae character. The ae character is known as the base letter form.

By default, Oracle indexes words in their base-letter form for these languages. Query terms are also converted to their base-letter form. The result is that these words can be queried with either spelling.

You can enable and disable alternate spelling for your language using the alternate_spelling attribute in the BASIC_LEXER preference type.

Composite Words

German and Dutch text contain composite words. By default, Oracle creates composite indexes for these languages. The result is that a query on a term returns words that contain the term as a sub-composite.

For example, in German, a query on the term Bahnhof (train station) returns documents that contain Bahnhof or any word containing Bahnhof as a sub-composite, such as Hauptbahnhof, Nordbahnhof, or Ostbahnhof.

You can enable and disable the creation of composite indexes with the composite attribute of the BASIC_LEXER preference.

Better Wildcard Query Performance

Wildcard queries enable you to issue left-truncated, right-truncated and doubly truncated queries, such as %ing, cos%, or %benz%. With normal indexing, these queries can sometimes expand into large word lists, degrading your query performance.

Wildcard queries have better response time when token prefixes and substrings are recorded in the index.

By default, token prefixes and substrings are not recorded in the Oracle Text index. If your query application makes heavy use of wildcard queries, consider indexing token prefixes and substrings. To do so, use the wordlist preference type. The trade-off is a bigger index for improved wildcard searching.

Document Section Searching

For documents that have internal structure such as HTML and XML, you can define and index document sections. Indexing document sections enables you to narrow the scope of your queries to within pre-defined sections. For example, you can specify a query to find all documents that contain the term dog within a section you define as Headings.

Sections must be defined prior to indexing and specified with the section group preference.

Oracle Text provides section groups with system-defined section definitions for HTML and XML. You can also specify that the system automatically create sections from XML documents during indexing.

Stopwords and Stopthemes

A stopword is a word that is not to be indexed. Usually stopwords are low information words in a given language such as this and that in English.

By default, Oracle provides a list of stopwords called a stoplist for indexing a given language. You can modify this list or create your own with the CTX_DDL package. You specify the stoplist in the parameter string of CREATE INDEX.

A stoptheme is a word that is prevented from being theme-indexed or prevented from contributing to a theme. You can add stopthemes with the CTX_DDL package.

You can search document themes with the ABOUT operator. You can retrieve document themes programatically with the CTX_DOC PL/SQL package.

Multi-Language Stoplists

You can also create multi-language stoplists to hold language-specific stopwords. A multi-language stoplist is useful when you use the MULTI_LEXER to index a table that contains documents in different languages, such as English, German, and Japanese.

At indexing time, the language column of each document is examined, and only the stopwords for that language are eliminated. At query time, the session language setting determines the active stopwords, like it determines the active lexer when using the multi-lexer.

Index Performance

There are factors that influence indexing performance including memory allocation, document format, degree of parallelism, and partitioned tables.

Index Creation

You can create three types of indexes with Oracle Text: CONTEXT, CTXCAT, and CTXRULE.

Procedure for Creating a CONTEXT Index

By default, the system expects your documents to be stored in a text column. Once this requirement is satisfied, you can create a text index using the CREATE INDEX SQL command as an extensible index of type context, without explicitly specifying any preferences. The system automatically detects your language, the datatype of the text column, format of documents, and sets indexing preferences accordingly.

Optionally, create your own custom preferences, section groups, or stoplists. See "Creating Preferences" in this chapter.

Create the Text index with the SQL command CREATE INDEX, naming your index and optionally specifying preferences. See "Creating an Index" in this chapter.

Creating Preferences

You can optionally create your own custom index preferences to override the defaults. Use the preferences to specify index information such as where your files are stored and how to filter your documents. You create the preferences then set the attributes.

Datastore Examples

The following sections give examples for setting direct, multi-column, URL, and file datastores.

Specifying DIRECT_DATASTORE

The following example creates a table with a CLOB column to store text data. It then populates two rows with text data and indexes the table using the system-defined preference CTXSYS.DEFAULT_DATASTORE.

Specifying File Data Storage

The following example creates a data storage preference using the FILE_DATASTORE. This tells the system that the files to be indexed are stored in the operating system. The example uses CTX_DDL.SET_ATTRIBUTE to set the PATH attribute of to the directory /docs.

MULTI_LEXER Example: Indexing a Multi-Language Table

You use the MULTI_LEXER preference type to index a column containing documents in different languages. For example, you can use this preference type when your text column stores documents in English, German, and French.

The first step is to create the multi-language table with a primary key, a text column, and a language column as follows:

Assume that the table holds mostly English documents, with some German and Japanese documents. To handle the three languages, you must create three sub-lexers, one for English, one for German, and one for Japanese:

Since the stored documents are mostly English, make the English lexer the default using CTX_DDL.ADD_SUB_LEXER:

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

Now add the German and Japanese lexers in their respective languages with CTX_DDL.ADD_SUB_LEXER procedure. Also assume that the language column is expressed in the standard ISO 639-2 language codes, so add those as alternate values.

Creating Section Groups for Section Searching

When documents have internal structure such as in HTML and XML, you can define document sections using embedded tags before you index. This enables you to query within the sections using the WITHIN operator. You define sections as part of a section group.

Example: Creating HTML Sections

The following code defines a section group called htmgroup of type HTML_SECTION_GROUP. It then creates a zone section in htmgroup called heading identified by the <H1> tag:

Using Stopwords and Stoplists

A stopword is a word that is not to be indexed. A stopword is usually a low information word such as this or that in English.

The system supplies a list of stopwords called a stoplist for every language. By default during indexing, the system uses the Oracle Text default stoplist for your language.

You can edit the default stoplist CTXSYS.DEFAULT_STOPLIST or create your own with the following PL/SQL procedures:

CTX_DDL.CREATE_STOPLIST

CTX_DDL.ADD_STOPWORD

CTX_DDL.REMOVE_STOPWORD

You specify your custom stoplists in the parameter clause of CREATE INDEX.

You can also dynamically add stopwords after indexing with the ALTER INDEX statement.

Multi-Language Stoplists

You can create multi-language stoplists to hold language-specific stopwords. A multi-language stoplist is useful when you use the MULTI_LEXER to index a table that contains documents in different languages, such as English, German, and Japanese.

To create a multi-language stoplist, use the CTX_DLL.CREATE_STOPLIST procedure and specify a stoplist type of MULTI_STOPLIST. You add language specific stopwords with CTX_DDL.ADD_STOPWORD.

Stopthemes and Stopclasses

In addition to defining your own stopwords, you can define stopthemes, which are themes that are not to be indexed. This feature is available for English only.

You can also specify that numbers are not to be indexed. A class of alphanumeric characters such a numbers that is not to be indexed is a stopclass.

You record your own stopwords, stopthemes, stopclasses by creating a single stoplist, to which you add the stopwords, stopthemes, and stopclasses. You specify the stoplist in the paramstring for CREATE INDEX.

PL/SQL Procedures for Managing Stoplists

You use the following procedures to manage stoplists, stopwords, stopthemes, and stopclasses:

Creating a CTXCAT Index

The CTXCAT indextype is well-suited for indexing small text fragments and related information. If created correctly, this type of index can give better structured query performance over a CONTEXT index.

CTXCAT Index and DML

A CTXCAT index is transactional. When you perform DML (inserts, updates, and deletes) on the base table, Oracle automatically synchronizes the index. Unlike a CONTEXT index, no CTX_DDL.SYNC_INDEX is necessary.

Note:

Applications that insert without invoking triggers such as SQL*Loader will not result in automatic index synchronization as described above.

About CTXCAT Sub-Indexes and Their Costs

A CTXCAT index is comprised of sub-indexes that you define as part of your index set. You create a sub-index on one or more columns to improve mixed query performance.

However, adding sub-indexes to the index set has its costs. The time Oracle takes to create a CTXCAT index depends on its total size, and the total size of a CTXCAT index is directly related to

total text to be indexed

number of sub-indexes in the index set

number of columns in the base table that make up the sub-indexes

Having many component indexes in your index set also degrades DML performance since more indexes must be updated.

Because of the added index time and disk space costs for creating a CTXCAT index, carefully consider the query performance benefit each component index gives your application before adding it to your index set.

Creating CTXCAT Sub-indexes

An online auction site that must store item descriptions, prices and bid-close dates for ordered look-up provides a good example for creating a CTXCAT index.

Like a combined b-tree index, the column order you specify with CTX_DDL.ADD_INDEX affects the efficiency and viability of the index scan Oracle uses to serve specific queries. For example, if two structured columns p and q have a b-tree index specified as 'p,q', Oracle cannot scan this index to sort 'order by q,p'.

Creating CTXCAT Index

The following example combines the examples above and creates the index set preference with the three sub-indexes:

Figure 2-2 shows how the sub-indexes A and B are created from the auction table. Each sub-index is a b-tree index on the text column and the named structured columns. For example, sub-index A is an index on the title column and the bid_close column.

Creating a CTXRULE Index

You use the CTXRULE index to build a document classification application. You create a table of queries and then index them. With a CTXRULE index, you can use the MATCHES operator to classify single documents.

Create a Table of Queries

The first step is to create a table of queries that define your classifications. We create a table myqueries to hold the category name and query text:

Example

The following code drops the preference my_lexer.

begin
ctx_ddl.drop_preference('my_lexer');
end;

Managing DML Operations for a CONTEXT Index

DML operations to the base table refer to when documents are inserted, updated or deleted from the base table. This section describes how you can monitor, synchronize, and optimize the Oracle Text CONTEXT index when DML operations occur.

Note:

CTXCAT indexes are transactional and thus updated immediately when there is an update to the base table. Manual synchronization as described in this section is not necessary for a CTXCAT index.

Viewing Pending DML

When documents in the base table are inserted, updated, or deleted, their ROWIDs are held in a DML queue until you synchronize the index. You can view this queue with the CTX_USER_PENDING view.

For example, to view pending DML on all your indexes, issue the following statement:

To understand index optimization, you must understand the structure of the index and what happens when it is synchronized.

CONTEXT Index Structure

The CONTEXT index is an inverted index where each word contains the list of documents that contain that word. For example, after a single initial indexing operation, the word DOG might have an entry as follows:

DOG DOC1 DOC3 DOC5

Index Fragmentation

When new documents are added to the base table, the index is synchronized by adding new rows. Thus if you add a new document (DOC 7) with the word dog to the base table and synchronize the index, you now have:

DOG DOC1 DOC3 DOC5
DOG DOC7

Subsequent DML will also create new rows:

DOG DOC1 DOC3 DOC5
DOG DOC7
DOG DOC9
DOG DOC11

Adding new documents and synchronizing the index causes index fragmentation. In particular, background DML which synchronizes the index frequently generally produces more fragmentation than synchronizing in batch.

Less frequent batch processing results in longer document lists, reducing the number of rows in the index and hence reducing fragmentation.

You can reduce index fragmentation by optimizing the index in either FULL or FAST mode with CTX_DDL.OPTIMIZE_INDEX.

Document Invalidation and Garbage Collection

When documents are removed from the base table, Oracle Text marks the document as removed but does not immediately alter the index.

Because the old information takes up space and can cause extra overhead at query time, you must remove the old information from the index by optimizing it in FULL mode. This is called garbage collection. Optimizing in FULL mode for garbage collection is necessary when you have frequent updates or deletes to the base table.

Single Token Optimization

In addition to optimizing the entire index, you can optimize single tokens. You can use token mode to optimize index tokens that are frequently searched, without spending time on optimizing tokens that are rarely referenced.

For example, you can specify that only the token DOG be optimized in the index, if you know that this token is updated and queried frequently.

An optimized token can improve query response time for the token.

To optimize an index in token mode, you can use CTX_DDL.OPTIMIZE_INDEX.

Viewing Index Fragmentation and Garbage Data

With the CTX_REPORT.INDEX_STATS procedure, you can create a statistical report on your index. The report includes information on optimal row fragmentation, list of most fragmented tokens, and the amount of garbage data in your index. Although this report might take long to run for large indexes, it can help you decide whether to optimize your index.