AsterixDB Support of Full-text search queries

Motivation [Back to TOC]

Full-Text Search (FTS) queries are widely used in applications where users need to find records that satisfy an FTS predicate, i.e., where simple string-based matching is not sufficient. These queries are important when finding documents that contain a certain keyword is crucial. FTS queries are different from substring matching queries in that FTS queries find their query predicates as exact keywords in the given string, rather than treating a query predicate as a sequence of characters. For example, an FTS query that finds “rain” correctly returns a document when it contains “rain” as a word. However, a substring-matching query returns a document whenever it contains “rain” as a substring, for instance, a document with “brain” or “training” would be returned as well.

Syntax [Back to TOC]

The syntax of AsterixDB FTS follows a portion of the XQuery FullText Search syntax. Two basic forms are as follows:

    ftcontains(Expression1, Expression2, {FullTextOption})
    ftcontains(Expression1, Expression2)

For example, we can execute the following query to find Chirp messages where the messageText field includes “voice” as a word. Please note that an FTS search is case-insensitive. Thus, “Voice” or “voice” will be evaluated as the same word.

    use TinySocial;

    select element {"chirpId": msg.chirpId}
    from ChirpMessages msg
    where ftcontains(msg.messageText, "voice", {"mode":"any"});

The DDL and DML of TinySocial can be found in ADM: Modeling Semistructured Data in AsterixDB.

The Expression1 is an expression that should be evaluable as a string at runtime as in the above example where msg.messageText is a string field. The Expression2 can be a string, an (un)ordered list of string value(s), or an expression. In the last case, the given expression should be evaluable into one of the first two types, i.e., into a string value or an (un)ordered list of string value(s).

The following examples are all valid expressions.

   ... where ftcontains(msg.messageText, "sound")
   ... where ftcontains(msg.messageText, "sound", {"mode":"any"})
   ... where ftcontains(msg.messageText, ["sound", "system"], {"mode":"any"})
   ... where ftcontains(msg.messageText, {{"speed", "stand", "customization"}}, {"mode":"all"})

The last FullTextOption parameter clarifies the given FTS request. If you omit the FullTextOption parameter, then the default value will be set for each possible option. Currently, we only have one option named mode. And as we extend the FTS feature, more options will be added. Please note that the format of FullTextOption is a record, thus you need to put the option(s) in a record {}. The mode option indicates whether the given FTS query is a conjunctive (AND) or disjunctive (OR) search request. This option can be either “all” (AND) or “any” (OR). The default value for mode is “all”. If one specifies “any”, a disjunctive search will be conducted. For example, the following query will find documents whose messageText field contains “sound” or “system”, so a document will be returned if it contains either “sound”, “system”, or both of the keywords.

   ... where ftcontains(msg.messageText, ["sound", "system"], {"mode":"any"})

The other option parameter,“all”, specifies a conjunctive search. The following examples will find the documents whose messageText field contains both “sound” and “system”. If a document contains only “sound” or “system” but not both, it will not be returned.

   ... where ftcontains(msg.messageText, ["sound", "system"], {"mode":"all"})
   ... where ftcontains(msg.messageText, ["sound", "system"])

Currently AsterixDB doesn’t (yet) support phrase searches, so the following query will not work.

   ... where ftcontains(msg.messageText, "sound system", {"mode":"any"})

As a workaround solution, the following query can be used to achieve a roughly similar goal. The difference is that the following queries will find documents where msg.messageText contains both “sound” and “system”, but the order and adjacency of “sound” and “system” are not checked, unlike in a phrase search. As a result, the query below would also return documents with “sound system can be installed.”, “system sound is perfect.”, or “sound is not clear. You may need to install a new system.”

   ... where ftcontains(msg.messageText, ["sound", "system"], {"mode":"all"})
   ... where ftcontains(msg.messageText, ["sound", "system"])

Creating and utilizing a Full-text index [Back to TOC]

When there is a full-text index on the field that is being searched, rather than scanning all records, AsterixDB can utilize that index to expedite the execution of a FTS query. To create a full-text index, you need to specify the index type as fulltext in your DDL statement. For instance, the following DDL statement create a full-text index on the GleambookMessages.message attribute. Note that a full-text index cannot be built on a dataset with the variable-length primary key (e.g., string).

use TinySocial;

create index messageFTSIdx on GleambookMessages(message) type fulltext;