In the case of <a name="foo"> it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it.

But in practise, it's probably good enough to work for the majority of people.

The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent).

Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.

As the author of WWW::Crawler::Lite, I am also appalled at the use of that regexp for URL detection! (What was I thinking?)

I am quite pressed for time at the moment, but I will put the module on github and re-release it with the patches/updates suggested on RT already.

FWIW I use this module in several places (and have for some time now). While there are perhaps some more "robust" spiders/crawlers out there, I wasn't able to find one as simple to use and understand as W:C:L.

In the case of <a name="foo"> it simply won't match, as the regexp includes href.

And what makes you think the regex would limit itself to a single tag? In your example, the "<a " could be matched while the "href=" would be much further down in the document. In fact, there is no guarantee that that this string is a tag attribute, it could just be in plain html text ("PCDATA"), Javascript code, or even in HTML comments.

To be reliable, a parser (actually just a lexer; it could be regex based) should extract whole tags, and you should then test each on its own. That would be much more reliable.