I understand that, for handling the validity of an Uniform Resource Identifier, it is necessary to :

1)Firstly, change any “per-cent” encoded character, < %80, which is an unreserved character, into its corresponding unreserved form. For example, %61 should be rewritten A and %7f should be rewritten ~

2)Secondly, “Per-cent” encode any character, with code > \x007f, which is NOT an unreserved character, according to its UTF-8 format. For example, the À would be “percent” encoded %C3%80 and the ア ( KATAKANA LETTER A ) would be “percent” encoded %e3%82%a2

( At this point, every character of an URI, should be a true ASCII character, with code-point < \x0080 ! )

3) Thirdly, verify, if the resulting address is a valid URI, according to the different rules, below, that describes the generic syntax of an Uniform Resource Identifier ( URI )

Practically, the normal use of such a validation regex would be quite ridiculous ! So we need to find an other practical regex, in order to help Notepad++ to recognize and properly underline an Internet address !

Note that the default Notepad++ behaviour, about underlining Internet addresses, just follows the URI standards !

For instance, the first address that Александр Корженевский gave, in its first post :

The second address is totally underlined, due to the per-cent encoding of all these Cyrillic characters

However, we should force the first address to be a valid one, allowing any word character to be part of an address, as most the addresses are not “well-formed”, with the per-cent mechanism !

But this condition is NOT sufficient ! Indeed, contrary to what I said, in my previous post, we meed to match NON-word characters, too ! Just imagine the simple address below :

https://www.google.fr/?gws_rd=ssl#newwindow=1&q=€

When you copy this address, in a new tab, Notepad++ underlined all this link, except for the single € sign. Nevertheless, if you select all that link, with the Euro sign, and paste it, for instance, in the Firefox address field, it does correctly display the Google results, for the Euro sign ( If, of course, the Goggle site is your default site, on opening Firefox )

So, my regex is wrong. In conclusion, Claudia, we could merge the exact regex for the Scheme frist component of an URI with your general regex, for the remainder of an URI, giving the final regex :

(?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s)

We could, also, use the more restrictive Wikipedia form :

(?-s)(https?ftp|mailto|file|data|irc)://.*?(?=\s)

=> These regexes should be used, in N++ code, to underline any Internet address :-))

Notes :

The use of \s is the best syntax, as it matches, either, any horizontal or any vertical blank character, as the limit of the address

However, Claudia and All, just be aware that this regex is, really, NOT restrictive, for the four last components of an URI ( Authority, Path, Query and Fragment ) !!

Best Regards,

guy038

P.S.

If you would like to break down a WELL-formed URI reference, in order to find its five components, you can use the S/R, below.

Why is this S/R so simple, compared to the enormous regex, used for Internet address validation ? Well, because we, simply, suppose that the matched address is a well-formed URI and that we just want to split it, in some parts !

So, the S/R, below, replace any correct Internet address, by the description of its five main components. Note that some parts may be undefined.

thank you guy for doing this research, good as always.
Yesterday I did some checks and create a python script
which let me test every unicode code point in the range
of 0x0 to ox10FFFF.
I will redo the tests with the new regex and see how it behaves.

The following tests have been successfully passed
Tested within the range of 0x0-0x10FFFF.

start of file tests
-url at start of file (no additional text) (also end of file test)
-url at start of file (followed by tab)
-url at start of file (followed by space)
-url at start of file (followed by eol)
-url at start of file (followed by tab and text)
-url at start of file (followed by space and text)
-url at start of file (followed by eol and text)

end of file tests
-url at end of file (preceded by tab)
-url at end of file (preceded by space)
-url at end of file (preceded by eol)
-url at end of file (preceded by text and tab)
-url at end of file (preceded by text and space)
-url at end of file (preceded by text and eol)

in the middle of a file tests
-url in the middle of a file (preceded and followed by tab)
-url in the middle of a file (preceded and followed by space)
-url in the middle of a file (preceded and followed by eol)
-url in the middle of a file (preceded and followed by text and tab)
-url in the middle of a file (preceded and followed by text and space)
-url in the middle of a file (preceded and followed by text and eol)

From my point of view it looks ok.
I’m going to open an enhancement request at github.

Up to now, even with the recent Unicode 9.0 version, all the other planes, from 3 to 13, are NOT used and all the corresponding code-points, from U+30000 to U+DFFFF are NOT assigned, except for the last two code-points of each place, which are assigned as NON characters

From the second one : the values U+10085, U+12028, U+12029, U+20085, U+22028 and U+22029 ( 6 values )

From your last list : the range U+30085…U+102029 ( 42 values )

I built a test file, containing all these characters, preceded by the letter a and followed by the letter z

Then, I tried to determine all the 3-characters string aXz, which was matched by the regex a\sz. After some tests, I can affirm that the \s regex, in a file with UNICODE encoding, matches any single character of the following list, ONLY :

And, except for the MEDIUM MATHEMATICAL SPACE ( \x205F ), which is NOT matched by the \s regex, this list is identical to the list of characters, that the UNICODE Consortium considers as White_Space characters. Refer to the link, below :

Finally, as most of these “White_Space” characters are quite exotic and very rarely used, in normal writing, the idea to use \s syntax, in a look-ahead, as a limit to an Internet address, seems quite pertinent !

Claudia, the new regex, to determine all the contents of an address, could, also, be written :

(?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s|\z)

Indeed, the case (?=\s) always happens, except when an Internet address would end the last line of a file, without any line-break ! And this specific case is just matched with the second (?=\z) syntax ;-)

Best Regards,

guy038

P.S. :

Claudia, I haven’t find some spare time, yet, to have a look to your new version of the RegexTexter script, with the Time regex test option. Just be patient a couple of days :-)