How to Use Regular Expressions in Xojo (REALbasic)

Xojo, formerly known as REALbasic, includes a built-in RegEx class. Internally, this class is based on the open source PCRE library. What this means to you as a Xojo developer is that the RegEx class provides you with a rich flavor of Perl-compatible regular expressions. The regular expressions tutorial on this website does not explicitly mention Xojo. Everything said in the tutorial about PCRE's regex flavor also applies to Xojo. The only exception are the case insensitive and "multi-line" matching modes. In PCRE, they're off by default, while in Xojo they're on by default.

Xojo uses the UTF-8 version of PCRE. This means that if you want to process non-ASCII data that you've retrieved from a file or the network, you'll need to use Xojo's TextConverter class to convert your strings into UTF-8 before passing them to the RegEx object. You'll also need to use the TextConverter to convert the strings returned by the RegEx class from UTF-8 back into the encoding your application is working with.

The RegEx Class

To use a regular expression, you need to create a new instance of the RegEx class. Assign your regular expression to the SearchPattern property. You can set various options in the Options property, which is an instance of the RegExOptions class.

To check if a regular expression matches a particular string, call the Search method of the RegEx object, and pass the subject string as a parameter. This method returns an instance of the RegExMatch class if a match is found, or Nil if no match is found. To find the second match in the same subject string, call the Search method again, without any parameters. Do not pass the subject string again, since doing so restarts the search from the beginning of the string. Keep calling Search without any parameters until it returns Nil to iterate over all regular expression matches in the string.

The RegExMatch Class

When the Regex.Search method finds a match, it stores the match's details in a RegExMatch object. This object has three properties. The SubExpressionCount property returns the number of capturing groups in the regular expression plus one. E.g. it returns 3 for the regex (1)(2). The SubExpressionString property returns the substring matched by the regular expression or a capturing group. SubExpressionString(0) returns the whole regex match, while SubExpressionString(1) through SubExpressionString(SubExpressionCount-1) return the matches of the capturing group. SubExpressionStartB returns the byte offset of the start of the match of the whole regex or one of the capturing groups depending on the numeric index you pass as a parameter to the property.

The RegExOptions Class

The RegExOptions class has nine properties to set various options for your regular expression.

Set CaseSensitive (False by default) to True to treat uppercase and lowercase letters as different characters. This option is the inverse of "case insensitive mode" or /i in other programming languages.

Set DotMatchAll (False by default) to True to make the dot match all characters, including line break characters. This option is the equivalent of "single line mode" or /s in other programming languages.

Set Greedy (True by default) to False if you want quantifiers to be lazy, effectively making .* the same as .*?. I strongly recommend against setting Greedy to False. Simply use the .*? syntax instead. This way, somebody reading your source code will clearly see when you're using greedy quantifiers and when you're using lazy quantifiers when they look only at the regular expression.

The LineEndType option is the only one that takes an Integer instead of a Boolean. This option affect which character the caret and dollar treat as the "end of line" character. The default is 0, which accepts both \r and \n as end-of-line characters. Set it to 1 to use auto-detect the host platform, and use \n when your application runs on Windows and Linux, and \r when it runs on a Mac. Set it to 2 for Mac (\r), 3 for Windows (\n) and 4 for UNIX (\n). I recommend you leave this option as zero, which is most likely to give you the results you intended. This option is actually a modification to the PCRE library made in Xojo. PCRE supports only option 4, which often confuses Windows developers since it causes test$ to fail against test\r\n as Windows uses \r\n for line breaks.

Set MatchEmpty (True by default) to False if you want to skip zero-length matches.

Set ReplaceAllMatches (False by default) to True if you want the Regex.Replace method to search-and-replace all regex matches in the subject string rather than just the first one.

Set StringBeginIsLineBegin (True by default) to False if you don't want the start of the string to be considered the start of the line. This can be useful if you're processing a large chunk of data as several separate strings, where only the first string should be considered as starting the (conceptual) overall string.

Similarly, set StringEndIsLineEnd (True by default) to False if the string you're passing to the Search method isn't really the end of the whole chunk of data you're processing.

Set TreatTargetAsOneLine (False by default) to make the caret and dollar match at the start and the end of the string only. By default, they will also match after and before embedded line breaks. This option is the inverse of the "multi-line mode" or /m in other programming languages.

Searching and Replacing

In addition to finding regex matches in a string, you can replace the matches with another string. To do so, set the ReplacementPattern property of your RegEx object, and then call the Replace method. Pass the source string as a parameter to the Replace method. The method will return a copy of the string with the replacement(s) applied. The RegEx.Options.ReplaceAllMatches property determines if only the first regex match or if all regex matches will be replaced.

In the ReplacementPattern string, you can use $&, $0 or \0 to insert the whole regular expression match into the replacement. Use $1 or \1 for the match of the first capturing group, $2 or \2 for the second, etc.

If you want more control over how the replacements are made, you can iterate over the regex matches like in the code snippet above, and call the RegExMatch.Replace method for each match. This method is a bit of a misnomer, since it doesn't actually replace anything. Rather, it returns the RegEx.ReplacementPattern string with all references to the match and capturing groups substituted. You can use this results to make the replacements on your own. This method is also useful if you want to collect a combination of capturing groups for each regex match.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site! Credit cards, PayPal, and Bitcoin gladly accepted.