Leveraging SQLite Full-Text Search on iOS

Share

Full-text search is an SQLite extension containing two modules — (fts3/4 and fts5) — that can be enabled by passing in the requisite compile-time flags. FTS1 and FTS2 are deprecated and have known issues, and SQLite developers do not recommend using them.

The FTS3/4 and FTS5 modules allow the creation of a virtual table that can contain large amounts of text and allow users to perform quick and complex searches across the entire dataset. In our SDK, we leverage this to build our indexed full-text search component.

Using the SQLite version bundled with iOS is pretty simple: You just add libsqlite3.dylib to the appropriate target in your app and import sqlite3/sqlite3.h wherever you require these APIs.

Creating and Querying an FTS Table via the C API

// First, we open the database connection.sqlite3*db=NULL;intret=sqlite3_open([NSTemporaryDirectory()stringByAppendingPathComponent:@"test.db"].UTF8String,&db);NSAssert(ret==SQLITE_OK,@"Failed to create database connection");// We then create the new table.ret=sqlite3_exec(db,"CREATE VIRTUAL TABLE IF NOT EXISTS ftsTable USING fts5(stringId, text, tokenize='unicode61')",NULL,NULL,NULL);NSAssert(ret==SQLITE_OK,@"Failed to create table");// Next we populate the table with some data.sqlite3_stmt*statement=NULL;constchar*sql="INSERT OR REPLACE INTO ftsTable (stringId, text) VALUES (?, ?)";ret=sqlite3_prepare_v2(db,sql,(int)strlen(sql),&statement,NULL);NSAssert(ret==SQLITE_OK,@"Failed to create prepared statement");// Text length is hardcoded here, but do not do this in production!sqlite3_bind_text(statement,1,"uniqueid_1",10,NULL);sqlite3_bind_text(statement,2,"Some great text",15,NULL);sqlite3_step(statement);sqlite3_reset(statement);sqlite3_bind_text(statement,1,"uniqueid_2",10,NULL);sqlite3_bind_text(statement,2,"PSPDFKit is a fantastic framework",33,NULL);sqlite3_step(statement);

The FTS query to the right of the MATCH operator is pretty powerful, and it should be able to accommodate all your needs. The SQLite website documents this well. Note that it isn’t strictly necessary to use MATCH when querying an FTS table. A LIKE expression works as well but is much slower.

Customizability of FTS

FTS3 and FTS4 are nearly identical. They share most of their code with each other, and their interfaces are the same. For the purposes of this article, we’re only considering FTS4 and FTS5.

When inserting data into an FTS table, SQLite runs the text through an algorithm called the “tokenizer.” Unless a specific tokenizer is specified as part of the CREATE VIRTUAL TABLE statement used to create the FTS table, the default tokenizer “simple” is used. This tokenizer simply converts input text into terms by lowercasing all the text and using the ASCII space as the delimiter. All code points in the text outside of ASCII (129 and higher) are discarded.

This is why the examples below all use the newer unicode61 tokenizer, which accepts most Unicode characters and uses Unicode’s case-folding algorithm, while also recognizing more than just the ASCII space character for separating terms. SQLite also offers ICU tokenizers (if enabled at compile), which can work with specific languages.

This tokenizer uses the same rules to separate the input text into terms including folding all terms into lower case, but also uses the Porter Stemming algorithm to reduce related English language words to a common root. For example, using the same input document as in the paragraph above, the porter tokenizer extracts the following tokens: “right now thei veri frustrat”. Even though some of these terms are not even English words, in some cases using them to build the full-text index is more useful than the more intelligible output produced by the simple tokenizer. Using the porter tokenizer, the document not only matches full-text queries such as “MATCH 'Frustrated'”, but also queries such as “MATCH 'Frustration'”, as the term “Frustration” is reduced by the Porter Stemming algorithm to “frustrat” - just as “Frustrated” is. So, when using the porter tokenizer, FTS is able to find not just exact matches for queried terms, but matches against similar English language terms.

At PSPDFKit, we use SQLite’s custom tokenizer API to create a Porter tokenizer that also works well with CJK characters, as they are not something the bundler tokenizer can handle.

Differences between FTS3/4 and FTS5

In earlier versions of FTS5 (like the one bundled with iOS 11), it was not possible to correctly filter a query’s results to a particular column, but that has since been rectified.

Overall, the FTS5 module has much better performance than its predecessors. However, it is missing some functionality when compared to FTS4, in that the auxiliary functions to extract information about matches aren’t as fully featured. Functions like matchinfo() and offsets() do not exist, and the snippet() function is not as useful as in FTS4.

Since we relied on FTS4’s offsets() function in our framework, we turned to FTS5’s auxiliary function API to implement it ourselves. Here’s a quick sample of how to use the API:

voidoffsets(constFts5ExtensionApi*api,Fts5Context*ftsContext,sqlite3_context*sqlite3Context,intnumberOfArguments,sqlite3_value**arguments){// Implement your function here.}/// From https://sqlite.org/fts5.htmlfts5_api*fts5APIFromDB(sqlite3*db){fts5_api*pRet=0;sqlite3_stmt*pStmt=0;// Disable the sizeof(pRet) call in clang analyzer because we actually want the size of a pointer.#ifndef __clang_analyzer__
if(SQLITE_OK==sqlite3_prepare_v2(db,"SELECT fts5()",-1,&pStmt,0)&&SQLITE_ROW==sqlite3_step(pStmt)&&sizeof(pRet)==sqlite3_column_bytes(pStmt,0)){memcpy(&pRet,sqlite3_column_blob(pStmt,0),sizeof(pRet));}#endif
sqlite3_finalize(pStmt);returnpRet;}intaddFTS5OffsetsFunction(sqlite3*db){fts5_api*api=fts5APIFromDB(db);intrc=SQLITE_FAIL;if(!api){// Handle failure.returnrc;}// This creates an auxiliary function called pdfc_offsets which is callable within an FTS5 query.// When called, it calls the `offsets` function defined by us.rc=api->xCreateFunction(api,"pdfc_offsets",NULL,offsets,NULL);returnrc;}

Checking If FTS Is Enabled

PSPDFKit for iOS supports using a customer-specified SQLite version, which we achieve by not hardlinking to SQLite and instead finding the symbols at runtime. This makes it important to know which version of FTS the customer’s SQLite supports. The best way to check
this is to use the sqlite3_compileoption_used() function:

However, if you’re working with an SQLite version that has modules enabled and FTS is added to a connection, checking the compile options is not enough. In this case, attempt to create a temporary FTS table, like so:

Limitations

A major limitation when using FTS tables with the Porter tokenizer is that you lose the ability to match words or phrases exactly. You can create a workaround by using SQL’s LIKE predicate in a query, but this is significantly slower than using FTS’s MATCH. Another alternative would be to create a separate table that uses a non-Porter tokenizer. This is just another tradeoff between size and speed that you have to make based on what sort of data is being stored in the database.

Additionally, another issue arises when debugging custom tokenizers: You can’t simply play around with them from a console, since they need to be loaded into SQLite via its programmatic API.

From iOS 11 onward, the watchdog is very proactive in killing apps that peg the CPU at higher usage for extended periods of time. Since the amount of CPU time afforded to an app varies on factors like the device, battery level, temperatures, and so on, there’s no simple point at which long-running CPU-intensive work can be paused.

When YES, PSPDFLibrary automatically pauses indexing if it has been running for a certain amount of time, and it resumes after about 30 seconds. If you expect your indexing operations to be long-running, you should consider implementing something similar in your own app.