I have two problems with doing grep searches using a script written in Javascript for InDesign CS4. The first one is more serious.

Problem #1:

I am using the following search pattern, to find e-mail addresses:

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

This pattern works perfectly to find all e-mail addresses when I use the grep feature of the Find/Change dialog box in the user interface, but fails when I use it as an argument of the findGrep() or changeGrep() functions in a script. Interestingly, the problem is the part before the "@" symbol. If I replace that part with literal text, like so:

richard@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

then the script will correctly match any e-mail address where the part before the "@" symbol is the word "richard".

Can anyone shed light on this problem?

Problem #2:

This is more general and less serious, but it is related to problem #1, because it's about things that work in the user interface but not in a script.

Strangely, a lot of the standard regular expression shorthand wildcards for character classes (like \w, \d, etc.) do not work when I use them in scripts, but their POSIX equivalents ([:word:], [:digit:], etc.) work fine. Either terminology -- \w or [:word:] -- works fine in the Find/Change dialog box of the user interface.

So this is not a serious problem, but I vastly prefer the shorter terminology. I find it easier to read, and to write.

was working in the first place. I thought maybe it was that some of those backslashes after the @ symbol are unnecessary anyway (I'm still getting a handle on regular expressions), but I took them out and it broke the script, so that can't be it.

Thank you Marc, that's very helpful. I like to think I understand this stuff theoretically, but I always have problems writing my own RegExps in practice. It's good to see examples.

Just out of curiosity, I have another problem that I am curious if anyone can help me with.

I am trying to match markdown web links. Markdown is a markup language that we use in our office because our editors (I work at a newspaper) find it easy to read and write. The format is

[text](link)

which corresponds to the html

<a href="link">text</a>

So I wrote the following:

{findWhat: "\\[[^][]+]\\([^)(]+\\)"}

and it strangely worked on my copy of CS4 at home, but here at work it doesn't seem to be working. It works up until the first left-round-bracket -- \\[[^][]+]\\( -- and then the rest of it doesn't match.

Oops. I posted too soon. There is no difference between my copies of CS4 at work and home. That would have been a little strange. What happened is that I was importing two different Word documents, and one worked and one didn't. My problem is still not solved, but I will examine the documents more closely to see if I can figure out what the problem with the script is.

I figured out the difference between the files. The one that failed to match was full of automatically generated Microsoft Word hyperlinks. The importing script I wrote is supposed to remove them completely before it does anything else to the text, and it appears to do that, but somehow there is something left over from the Word hyperlink which messes up regexp matches.

and somewhere along the line, Word automatically generates a hyperlink, starting at "http". Somehow, that section of text ends up being impossible to match, even if I just try to match the literal string "http". I'm not sure why, since I strip all the hyperlinks from the file when it comes into InDesign.

Unfortunately, it's unrealistic to try to get the writers to stop using Word.

Any ideas about what might be left over in the text from the Word hyperlink, that I cannot see?

I figured it out. When I delete all the hyperlinks, I also need to delete all the hyperlinkTextSources, otherwise the regexp engine won't be able to smoothly find a match across a block of text that includes a hyperlinkTextSource.

[^][]+ means one or more of the character class which includes any character except [ and ]. They are listed in reverse order (i.e. ][) in the regexp pattern because only the left square bracket is allowed to come second in that order, if you don't want to use backslashes. I'm not sure why I did it that way -- I guess I thought if any text came in with either bracket it should be rejected, but I have changed it to your suggestion, because there's no reason to exclude the opening brackets.

I took your suggestions and I wrote a working script which converts markdown to InDesign hyperlinks, which is a bit more involved than just converting it to html, because part of the original match stays in the text object and part of it gets assigned to a new hyperlink object as a string.

I have another question now (it's not urgent, if you're too busy), and for the sake of brevity I am using your markdown to html example. The question is, how exactly can you deal with escaped characters? Say, for instance, you had a sentence in an article where you're quoting someone and you say 'The next day was [her] worst day ever,' and you want '[her] worst day ever' to be a link. Or say you have a URL that has round brackets in it, which quite a few of them do on Wikipedia for some weird reason.

I thought you might be able to do this using lookbehind, but Javascript's regexp flavor does not appear to support lookbehind, so I wrote the following code, which hides away all the escaped characters during the major processing, and then restores them. It works, but is there a better way? It doesn't seem very elegant, although it's certainly easy and fairly foolproof, which is probably good:

I have found a much shorter solution, that requires a lot less code and uses regular expressions entirely to accomplish the exact same task that the script in my last post does, which is to process markdown hyperlinks into html hyperlinks, while taking into account certain escaped characters. I have also started using John Hawkinson's method of making the regular expressions a little easier to read, since there's now a serious proliferation of backslashes:

But strangely, I think the version in my previous post (with some slight modifications) might be more elegant and reliable in the general case, once I start adding support for a lot more markdown codes. In a way the previous one is simpler -- just get the escaped characters out of the way, deal with everything you have to deal with, and then put them back. Of course, I would probably change the placeholder text from long strings like "%_LEFT_SQUARE_BRACKET" to single Unicode characters that no one would ever use, like Linear B syllables or something, assigned to variable names like "LEFT_SQUARE_BRACKET".

By the way, Marc, I read some of your website and I have to thank you for your very clear and thorough explanations of some advanced topics in scripting InDesign. I will be particularly carefully studying the section on adding menu items. I often think that I could write a script to have lasers shoot out of my eyes or to automatically generate a new play in the style of William Shakespeare, and my employers would just be confused. But if I could accomplish these things via menu items, then they would be official. People would be impressed.

Yes, I've been reading up on that a bit and it seems that it might be a good idea to step outside the regex bubble to manage nested parentheses.

I've never thought of supporting the entire markdown protocol, but it's certainly an interesting project. It's a pretty big leap between putting out specific fires at my workplace and writing a script that would be generally useful to people in other contexts, but perhaps if I get enough of it done, I might as well deal with the rest of markdown. I am a bit of a newbie at programming but I'm sure if I had a half-way working version I could show it to people and it could be fixed up and made more robust.

And yes, thank you again John for that simple trick. I've been using it almost exclusively since you pointed it out.

On second thought, a markdown-to-InDesign script would probably not be too much of a problem at all. I just took a look at the PHP code for converting markdown into html, and I could use that as a guide (I do have some experience with PHP). I'll get to work.