Is there anyway I could restructure the regex or the the regex method I'm using to make this cleaner?

Update

To clarify, my question is not about the correctness of my regex - I realize that it's limited. Instead I'm wondering if anyone had any comments on the method of substiting in links for the phone numbers - is there anyway I could use re.replace or something like that instead of the string hackery that I have?

A note: In my practice, I've not noticed much of a speedup pre-compiling simple regular expressions like the two I'm using, even if you're using them thousands of times. The re module may have some sort of internal caching - didn't bother to read the source and check.

Also, I replaced your method of checking each character to see if it's in string.digits with another re.sub() because I think my version is more readable, not because I'm certain it performs better (although it might).

First off, reliably capturing phone numbers with a single regular expression is notoriously difficult with a strong tendency to being impossible. Not every country has a definition of a "phone number" that is as narrow as it is in the U.S. Even in the U.S., things are more complicated than they seem (from the Wikipedia article on the North American Numbering Plan):

And that's just for the relatively trivial numbering plan of the U.S., and even there it certainly is not covering all subtleties. If you want to make it reliable you have to develop a similar beast for all expected input languages.

A few things that will clean up your existing regex without really changing the functionality:

Replace {0,1} with ?, [(] with (, [)] with ). You also might as well just make your [2-9] b e a \d as well, so you can make those patterns be \d{3} and \d{4} for the last part. I doubt it will really increase the rate of false positives.