AsterixDB Support of Full-text search queries

Full-Text Search (FTS) queries are widely used in applications where users need to find records that satisfy an FTS predicate, i.e., where simple string-based matching is not sufficient. These queries are important when finding documents that contain a certain keyword is crucial. FTS queries are different from substring matching queries in that FTS queries find their query predicates as exact keywords in the given string, rather than treating a query predicate as a sequence of characters. For example, an FTS query that finds “rain” correctly returns a document when it contains “rain” as a word. However, a substring-matching query returns a document whenever it contains “rain” as a substring, for instance, a document with “brain” or “training” would be returned as well.

For example, we can execute the following query to find Chirp messages where the messageText field includes “voice” as a word. Please note that an FTS search is case-insensitive. Thus, “Voice” or “voice” will be evaluated as the same word.

The Expression1 is an expression that should be evaluable as a string at runtime as in the above example where msg.messageText is a string field. The Expression2 can be a string, an (un)ordered list of string value(s), or an expression. In the last case, the given expression should be evaluable into one of the first two types, i.e., into a string value or an (un)ordered list of string value(s).

The last FullTextOption parameter clarifies the given FTS request. If you omit the FullTextOption parameter, then the default value will be set for each possible option. Currently, we only have one option named mode. And as we extend the FTS feature, more options will be added. Please note that the format of FullTextOption is a record, thus you need to put the option(s) in a record {}. The mode option indicates whether the given FTS query is a conjunctive (AND) or disjunctive (OR) search request. This option can be either “all” (AND) or “any” (OR). The default value for mode is “all”. If one specifies “any”, a disjunctive search will be conducted. For example, the following query will find documents whose messageText field contains “sound” or “system”, so a document will be returned if it contains either “sound”, “system”, or both of the keywords.

The other option parameter,“all”, specifies a conjunctive search. The following examples will find the documents whose messageText field contains both “sound” and “system”. If a document contains only “sound” or “system” but not both, it will not be returned.

Currently AsterixDB doesn’t (yet) support phrase searches, so the following query will not work.

... where ftcontains(msg.messageText, "sound system", {"mode":"any"})

As a workaround solution, the following query can be used to achieve a roughly similar goal. The difference is that the following queries will find documents where msg.messageText contains both “sound” and “system”, but the order and adjacency of “sound” and “system” are not checked, unlike in a phrase search. As a result, the query below would also return documents with “sound system can be installed.”, “system sound is perfect.”, or “sound is not clear. You may need to install a new system.”

When there is a full-text index on the field that is being searched, rather than scanning all records, AsterixDB can utilize that index to expedite the execution of a FTS query. To create a full-text index, you need to specify the index type as fulltext in your DDL statement. For instance, the following DDL statement create a full-text index on the GleambookMessages.message attribute. Note that a full-text index cannot be built on a dataset with the variable-length primary key (e.g., string).

Apache AsterixDB, AsterixDB, Apache, the Apache
feather logo, and the Apache AsterixDB project logo are either
registered trademarks or trademarks of The Apache Software
Foundation in the United States and other countries.
All other marks mentioned may be trademarks or registered
trademarks of their respective owners.