Habaritag:www.blog.montgomerie.net,2015-03-03:regex/5a8f9f03a1f80e2ef3f3e18e92dc814d7b8a895aJames Montgomerie’s World Wide Web Log2010-01-30T09:29:33+00:00Using the Mac or iPhone's Built In Regex Routinesjamiehttp://www.blog.montgomerie.nettag:www.blog.montgomerie.net,2010:using-the-mac-or-iphones-built-in-regex-routines/12648407912010-01-30T09:29:33+00:002013-06-15T21:25:01+01:002010-01-30T09:29:33+00:00<p class="tagline">In which a convenient method of using POSIX regular expressions from Objective-C is presented.</p>
<p>It&#8217;s a common complaint that the Mac and iPhone platforms don&#8217;t have native support for regular expressions, but that&#8217;s not entirely true. If you drop down to the UNIX core, there&#8217;s an implementation of the old (and only partially busted) POSIX regular expression interfaces. Here, I&#8217;ll show a simple Objective-C wrapper class for them that lets you use them conveniently in Mac or iPhone apps.</p>
<p>Before I start, some preemptive remarks: There&#8217;s a <em>lot</em> wrong with POSIX regexes to modern eyes. Firstly, and most glaringly, the work on <em>byte streams</em>, and know nothing of characters beyond ASCII. That means that if you&#8217;re not careful about your string encoding, and what your regexes specify, you might end up mangling your strings pretty badly. If you don&#8217;t understand what UTF-8 is beyond that it&#8217;s a &#8220;text encoding&#8221; go and <a href="http://en.wikipedia.org/wiki/UTF-8#Description">read up on how it works, and how characters in it relate to bytes</a> before you use these routines. If you&#8217;re aware, these are safe to use - just bear in mind that , for example, <samp>.{1,4}</samp> will match 1-4 <em>bytes</em>, not <em>characters</em>. Secondly, they&#8217;re slow. Like, really slow. Even using precompiled regexes, they were around 500 times slower than PERL&#8217;s regex routines the last time I benchmarked. Some of this can be said to be the &#8216;fault&#8217; of the POSIX standard, which specifies a more expressive regex concept than PERL&#8217;s, but a lot of it is just that the implementation Apple uses is crufty and old. There are <a href="http://laurikari.net/tre/">better implementations out there</a>. Anyway, suffice to say, you don&#8217;t want to be doing tons of text processing with these routines in performance sensitive code. Thirdly, the syntax of them is not the familiar PERL-compatible syntax we&#8217;re all mostly familiar with - you&#8217;ll need to think a bit differently to use them. In day-to-day use, this mostly just means that the character classes are specified differently (e.g. <samp>[[:digit:]]</samp> instead of <samp>\d</samp>). See <samp><a href="http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man7/re_format.7.html">man re_format</a></samp> for more details.</p>
<p>Having said all that, why would you want to use these? Well, if you&#8217;re aware of the issues above, and all you want to do is use a couple of convenient regexes in your code - parse a couple of HTTP headers, or match a few strings now and again, say - these routines work just fine. Including the couple of ObjC files I present here is a lot lighter weight in code-size and complexity terms than including a whole regex library. I use it all over the place in <a href="http://eucalyptusapp.com/">Eucalyptus</a>, and it does a sterling job.</p>
<p>So, here&#8217;s the interface:</p>
<pre><code>@interface THRegex : NSObject {
- (id)initWithPOSIXRegex:(NSString *)regexString;
- (id)initWithPOSIXRegex:(NSString *)regexString flags:(int)flags;
+ (id)regexWithPOSIXRegex:(NSString *)regexString;
+ (id)regexWithPOSIXRegex:(NSString *)regexString flags:(int)flags;
- (BOOL)matchString:(NSString *)string;
- (NSString *)match:(NSInteger)index;
@end
@interface NSString (THRegex)
- (THRegex *)matchPOSIXRegex:(NSString *)regexString;
- (THRegex *)matchPOSIXRegex:(NSString *)regexString flags:(int)flags;
- (NSString *)stringByEscapingPOSIXRegexCharacters;
@end
</code></pre>
<p>It&#8217;s pretty easy to use. At its simplest, you can just do things like:</p>
<pre><code>if([@"myString" matchPOSIXRegex:@"^.*Str.*$"]) {
NSLog(@"Match!");
}
</code></pre>
<p>You can also pull out parenthesized matches (watch out - they start at 1, match 0 is the entire regex match):</p>
<pre><code>NSString *contentRange = [myHTTPHeaders objectForKey:@"content-range"];
THRegex *contentStartRegex = [contentRange matchPOSIXRegex:@"^[[:space:]]*bytes[[:space:]]+([[:digit:]]+)" flags:REG_EXTENDED|REG_ICASE];
if(contentStartRegex) {
NSString *contentsStartString = [contentStartRegex match:1];
NSLog(@"Received Content-Range: %@", contentsStartString);
}
</code></pre>
<p>If you&#8217;re going to be doing a few matches at once against the same regex, you can also construct a <samp>THRegex</samp> object and use it directly, instead of the <samp>NSString</samp> category (contrived example follows - this is not the best way to get a list of filenames matching a pattern):</p>
<pre><code>NSMutableArray *mySpreadsheetFiles = [NSMutableArray array];
THRegex *matchSpreadsheets = [[THRegex alloc] initWithPOSIXRegex:@".*\\.xls"];
for(NSString filename in [[NSFileManager defaultManager] directoryContentsAtPath:myDocumentsPath]) {
if([matchSpreadsheets matchString:filename]) {
[mySpreadsheetFiles addObject:filename];
}
}
[matchSpreadsheets release];
NSLog(@".xls files: %@", mySpreadsheetFiles);
</code></pre>
<p>And that&#8217;s about it. One final implementation detail: because the POSIX routines are so slow at compiling regular expressions, in my routines, the first time you use a regex, it&#8217;s compiled and cached so that it&#8217;ll be available with no delay the next time you use it. This makes sense if all the regexes you use are hard-coded - it&#8217;s only going to increase your memory usage by a small, constant factor. If you don&#8217;t want this behaviour - say, because you&#8217;re generating regexes on the fly, or because users can enter regexes in your UI (I <em>really</em> don&#8217;t recommend this by the way - it&#8217;s a certainty that your users will not know all the POSIX regex caveats I mentioned at the top of this post, and it&#8217;s unlikely they even know POSIX regex syntax), you can turn the cache off with a <samp>THREGEX_DONT_CACHE</samp> compile-time define (e.g. by specifying <samp>THREGEX_DONT_CACHE</samp> in the &#8220;Preprocessor Macros&#8221; setting for your target in Xcode).</p>
<p>To find out more about POSIX regular expression syntax, see <samp><a href="http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man7/re_format.7.html">man re_format</a></samp>. <samp><a href="http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man3/regex.3.html">man regex</a></samp> covers the flags (<samp>REG_EXTENDED</samp>, <samp>REG_ICASE</samp> etc.) that are also used here when constructing regex objects.</p>
<p>Here&#8217;s the code - it&#8217;s released under a simple BSD-style licence:</p>
<p><center><a href="http://www.montgomerie.net/downloads/THRegex.zip"><img alt="downloadZip.png" src="http://blog.montgomerie.net/user/files/downloadZip.png"></a></center></p>
PWD in the Title Bar (or, “A Regex Adventure in BASH”)jamiehttp://www.blog.montgomerie.nettag:blog.montgomerie.net,2009:post/12427437692009-05-19T15:49:19+01:002013-06-15T21:10:20+01:002008-05-14T13:00:00+01:00<p><img class="framedImage" width="505" alt="PWDInTheTitleBar.png" src="http://blog.montgomerie.net/user/files/PWDInTheTitleBar.png"></p>
<p class="tagline">In which a five-minute hack to put the current working directory in the title bar of the Terminal descends into hours learning the surprising arcanities of BASH.</p>
<p>Like most developers, I spend a lot of time at my computer living in Terminal (and the rest Xcode)*. I have what might be a strange tendency to open a new Terminal window for each new task, I think so that I don’t have to lose the context in my current window. This means that after I’ve finished doing something (or in the middle, when I look at the screen and my obsessive-compulsive side grabs me and says “Wow, that’s messy. I need to tidy up my desktop before I can do more work”) I often find myself with a vast number of Terminals open, with the majority sitting minimised in the Dock, and no idea which ones are actually still useful.</p>
<p>In the past, I found a cool script snippet that I added to my BASH <samp>.profile</samp> that would put the current working directory in the title bar of the Terminal. This is a huge help. It means that I can quickly choose the ‘right’ window for whatever I’m doing just by looking at the title bar, or by selecting it from Terminal’s <em>Window</em> menu, and when I’m on one of my cleaning frenzies I don’t have to go through the windows trying to work out what I’m using them for; I can just mouse over the minimised window, see the title and (almost always) close it with the context menu.</p>
<p>I didn’t have that snippet any more though (the hardware it’s possibly even still on belonged to Apple, not me, and I didn’t think about retrieving it in time), but I was sure that Google would provide the answer to me.</p>
<p>And it did, sort of. There <a href="http://www.google.co.uk/search?q=bash%20prompt%20title%20bar">lots of pages</a> on the web about configuring your BASH prompt, and even a full ‘<a href="http://tldp.org/HOWTO/Bash-Prompt-HOWTO/index.html">HOWTO</a>’. None of them gave a cut-and paste example of exactly what I wanted though. <a href="http://tldp.org/HOWTO/Bash-Prompt-HOWTO/x783.html">This section</a> of the HOWTO came pretty close, but didn’t quite do it. After reading it though, my demands increased. Seeing what others had done, it seemed like it would be pretty cool if, besides having the directory name in the title bar, it was also formatted nicely. I now wanted it to be truncated to size (with ellipses, not just any old truncation), and to replace my home directory with “~/”. The snippet given there almost does this, but it smelled a bit off to me. It uses <samp>grep</samp>, <samp>awk</samp>, <samp>wc</samp>, <samp>tr</samp>, <samp>sed</samp> - a veritable cornucopia of shell commands - to do its thing. Although I’m all for the UNIX philosophy of small tools doing what they do well, it doesn’t seem like so many external processes should be required just for a bit of string manipulation (from the page: “Relative speed: the first version takes about 0.45 seconds on an unloaded 486SX25. Risto’s version takes about 0.80 to 0.95 seconds.” - of course, on my modern machine they were instant, but it’s the principle&#8230;)</p>
<p>So, I did some research and found something that pleasantly surprised me. Did you know that BASH actually now supports in-process regexes? Maybe I’m late to the party, bit I didn’t. You can (in the form here, since version 3.2) do this:</p>
<p><center><samp>[[ string =~ regex ]]</samp></center></p>
<p>And the regex will be executed for you, a <samp>0</samp> (which, in shell land, indicates success) returned if the string matches, a <samp>1</samp> if it doesn’t, and a <samp>2</samp> if you screwed up and gave it an invalid regex. Matches for any parenthesised parts of the regex are placed in an array called <samp>BASH_REMATCH</samp> (item <samp>0</samp> is the entire match, items <samp>1</samp>+ are the parenthesised expressions). It’s all even there in the BASH man page (look under “Compound Commands”). The regexes are not in the Perl format most people expect nowadays, they’re in POSIX format, as described by <samp><a href="http://developer.apple.com/documentation/Darwin/Reference/ManPages/man7/re_format.7.html">man re_format</a></samp>, but that’s not hard to pick up.</p>
<p>Using my new-found shell-script-regex powers, and some more of BASH’s string manipulation routines, I wrote my own pwd-to-title-bar routine, taking inspiration from the second one on the <a href="http://tldp.org/HOWTO/Bash-Prompt-HOWTO/x783.html">HOWTO page I mentioned before</a>, but using all in-process string manipulation (hopefully, this will run faster on Risto’s 486SX25). Here it is:</p>
<pre><code>function directory_to_titlebar {
local pwd_length=42 # The maximum length we want (seems to fit nicely
# in a default length Terminal title bar).
# Get the current working directory. We'll format it in $dir.
local dir="$PWD"
# Substitute a leading path that's in $HOME for "~"
if [[ "$HOME" == ${dir:0:${#HOME}} ]] ; then
dir="~${dir:${#HOME}}"
fi
# Append a trailing slash if it's not there already.
if [[ ${dir:${#dir}-1} != "/" ]] ; then
dir="$dir/"
fi
# Truncate if we're too long.
# We preserve the leading '/' or '~/', and substitute
# ellipses for some directories in the middle.
if [[ "$dir" =~ (~){0,1}/.*(.{${pwd_length}}) ]] ; then
local tilde=${BASH_REMATCH[1]}
local directory=${BASH_REMATCH[2]}
# At this point, $directory is the truncated end-section of the
# path. We will now make it only contain full directory names
# (e.g. "ibrary/Mail" -&gt; "/Mail").
if [[ "$directory" =~ [^/]*(.*) ]] ; then
directory=${BASH_REMATCH[1]}
fi
# Can't work out if it's possible to use the Unicode ellipsis,
# '…' (Unicode 2026). Directly embedding it in the string does not
# seem to work, and \u escape sequences ('\u2026') are not expanded.
#printf -v dir "$tilde/\u2026$s", $directory"
dir="$tilde/...$directory"
fi
# Don't embed $dir directly in printf's first argument, because it's
# possible it could contain printf escape sequences.
printf "\033]0;%s\007" "$dir"
}
if [[ "$TERM" == "xterm" || "$TERM" == "xterm-color" ]] ; then
export PROMPT_COMMAND="directory_to_titlebar"
fi
</code></pre>
<p><samp>PROMPT_COMMAND</samp> is run by BASH every time it’s about to put up a prompt, so I set it (at the bottom of the snippet) to my title-bar-renaming function, <samp>directory_to_titlebar</samp>. The actual function gets the current working directory, replaces the home directory path, if it’s there, with “<samp>~/</samp>”, appends a slash and, if necessary to make the whole string fit in Terminal’s title bar nicely, snips out directories after the initial “<samp>/</samp>” or “<samp>~/</samp>”, replacing them with “<samp>&#8230;</samp>”. The escape codes in the final printf are what tells Terminal to use the enclosed string as the title instead of printing it to console. The only thing I wanted but was unable to do was to get the string to use the Unicode ellipsis character instead of the three full-stops. It seems weird in these modern times but, unless I’m missing something, BASH’s string entry routines are not Unicode-aware (if it’s possible, please do leave a comment telling me how!)</p>
<p>If you want to use it too, just copy it and paste it into your <samp>~/.profile</samp> file. I believe it should work on any system that uses or emulates an XTerm, not just the Mac OS X Terminal.</p>
<p><img class="framedImage" align="left" width="372px" alt="TerminalSettings.png" src="http://blog.montgomerie.net/user/files/TerminalSettings.png">The last thing to do to get what I wanted was to set the Window preferences to show the rest of the information I thought would be useful. I’ve pasted a screenshot of Terminal’s prefs on the left (this is in the left-hand pane of the <em>Settings</em> tab of Terminal’s prefs). The <em>Title</em> setting (here reading “Terminal”) is what’s replaced by the string we’re ‘printing’ in the routine above, the <em>Active Process Name</em> setting enables you to see what’s currently running in that Terminal in the title bar, and the <em>Command Key</em> option puts the window’s command key in the title bar, meaning that you can tell at a glance which window you can bring forward by pressing Cmd-2 at any time. You can see how it all works together in the screenshot at the top of this page.</p>
<p>And with that (phew) my marathon of BASH was over, and I could get back to what I really should have been doing all along, but with my environment made just a little bit more pleasant.</p>
<ul>
<li><em>with no time whatsoever spent in Safari, or writing complexly formatted blog entries when I should be working.</em></li>
</ul>