Detect broken URLs in the editor

Description

Chatted a bit with @azaozz. We could detect broken URLs in the editor when the user adds one by sending a request to the server and then doing a HEAD request and check the response code. If it's 4xx, we could mark the bad link in the editor and if the user clicks on it show the toolbar with a message. This is to prevent accidental user mistakes (wrong copy/paste, entering query by accident...)

The title attribute might be not the best idea in terms of a11y. Maybe we should use wp.a11y.speak()?

Yep I'd recommend to don't add new title attributes. From an a11y perspective when the link is broken there's no audible feedback and a barely noticeable visible feedback (the title attribute "tooltip").

To get the error message, screen reader users should (unlikely) go through the following process:

after the link is inserted, press Alt+F8 to move focus to the link toolbar

explore the link toolbar using the arrow keys (not the tab key since the displayed link is not focusable)

finally, only when the screen reader is reading out the link displayed in the toolbar they would get the error message in the title attribute, see below:

I'd consider @ocean90 suggestion about using wp.a11y.speak() to convey an audible warning message.

For sighted users also, the title attribute is barely noticeable and not so useful on touch devices. It would be great to try to find a good idea for a better visual feedback.

About colors, worth reminding the Editor background may be styled, for example:

Lastly, I'd suggest to review the error messages wording. As an average users I'd be probably be confused by a message saying something about "host name" or "HTTP error 404". I'd just need to know the link is broken :) As far as I see, the current error messages are:

'Invalid host name.'
'Not found, HTTP error 404.'

Maybe using a generic, simpler, sentence followed by the "technical" error code could work?

What about using contextual links within the url? say instead of entering ​http://mysite.org/page/subpage I just entered in /page/subpage would it automatically try to insert the home_url in front to check the validity of the url entered?

We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:

server is unreachable

bad request (400)

authentication error (401, 403)

Internal Server Error and the rest of the 5xx codes, etc.
Even if we add some of there, there is no good way to account for all (some may be temporary, etc.).

Many web servers block the WordPress user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.

In these terms we will miss a lot of broken URLs. This makes it an unreliable check. Furthermore immediately after inserting a link the URL is visible in the inline link toolbar. If in doubt the user can test it very easy by clicking on it. Then none of the above restrictions will apply as it will be a standard browser request.

Perhaps it will be better to just do a quick regex test for malformed URLs.

We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:

IMO the current checks are enough, reliable and cover the most common errors. But more important, those can be fixed by users themselves. Producing a 400 error seems pretty hard, it's usually the result of sending bad data to an API service. Authentication errors should be ignored because the current user and their readers might have access to it. 5xx errors are also unlikely and these should (hopefully) only be temporary.

Many web servers block the WordPress user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.

This is the first time I'm hearing this. :) Do you happen to know what the status code is? It's probably not 404.

In 36638.7.patch: also added wp.a11y.speak() as suggested above. However if we keep checking the URLs by AJAX, that speak() will be triggered few seconds after inserting a link. It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.

Do you happen to know what the status code is? It's probably not 404.

No, never seen 404. There is (perhaps) a chance that some servers will let the specific link check UA WordPress URL Test through, but still thinking it makes this unreliable.

Quickly tested 36638.7.patch​ especially the wp.a11y.speak() part. See screenshot below, using a small personal dev plugin to log messages in the console:

Technically everything works nicely but I'd recommend a few improvements.

use the assertive parameter: since it's an error, it makes sense to force screen readers to announce this message as soon as possible

when there is an error, the following message "Link inserted." should not be dispatched: testing with VoiceOver, it gets confused and announces just the last message (no error announced at all)

It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.

maybe there's no need to use the URL in the message: users just entered that URL, they know what they typed. Yes, they could have typed incorrectly but maybe they need just to be informed the link destination is not reachable and asked to check the link?

More related to wp_remote_get() but worth reminding some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL. I usually change my DNS but I forgot to do that on the machine I was using for testing. Thus, with my ISP default DNS, any not existent URL I was intentionally typing was "good" and no error appeared. Specifically, I was getting a 302 followed by a 200:

I'm afraid there's no much we can do here other than blaming bad ISPs practices :(

About the JS part, I'm not sure I understand the checks for visible() for example in setLinkError() where toolbar.visible() is always false for me.

The only things changed in 36638.8.patch​ are the assertive parameter for the URL check message and the message string. I'd propose to simplify this message and try to describe what actually happened: the link was inserted but the URL check returned an error. We can't guarantee the link destination is really unreachable (could be temporarily unreachable for example) so I'd say to change the "error" in a more relaxed "warning":Warning: the link has been inserted but the link destination cannot be reached.

The assertive parameter should guarantee that even if the "Link inserted." message is dispatched before the URL check (and as far as I see this can happen depending for example on the time required by the AJAX response), anything that is being announced by screen readers int hat moment will be interrupted by the warning.

some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL.
I'm afraid there's no much we can do here other than blaming bad ISPs practices :(

Yes, actually most ISPs do that now (AFAIK). Yet another reason the URL test will fail. If we don't follow the redirect, the response will be "undecided", if we follow it, we get 200. In both cases we don't flag the URL as "bad".

I'm not sure I understand the checks for visible() for example in setLinkError() where toolbar.visible() is always false for me.

setLinkError() is used in two places: one is after a regex check for well formed URL. As that is very fast, the toolbar.visible() will always be false. The other is after the AJAX response. The toolbar will usually be visible there depending on how long the AJAX takes (if it's very very fast, it may not be visible yet).

36638.8.patch looks good. Going to commit it now to get the accessibility improvements in.

As mentioned on Slack, the current regex needs to be improved because it only supports ASCII chars. URLs like http://кто.рф/eng/, http://münchen.de or http://香港大學.香港/about/ shouldn't be marked as invalid.

I'd just like to voice an opinion that performing a HTTP request to "validate" each URL isn't likely to be very workable in the long run.

The scenario's that I'm looking at are things like..

Editors may have a URL which is currently 403/404 which they wish to have in a draft, causing them to just ignore those notifications all the time

Linking to sites which intentionally block, or rate limit the WordPress UA would incorrectly fail (Leading to a bad user experience and inability to trust it going forward)

The link would be marked inaccessible for sites which are temporarily unavailable, slow, or inaccessible from the servers network

The WordPress UA might get rate limited or blocked on more sites/networks for being seen as making a lot of irrelevant requests. We already see this happening with blocking of the WordPress UA (as noted above) due to the XML-RPC attacks.

If the HTTP request is kept, then

a HEAD should be performed instead IMHO

The limit_response_size parameter should be set to avoid issues when linking to large documents.

The Relative URL handling could probably use WP_HTTP::make_absolute_url()

Personally I feel that anything more than a regular expression is probably overkill here for the limited benefit it provides.

I like the 8th and 11th regex from there (need to convert them to JS). They are listed as failing to detect some of the IPs but think this is a "feature". We don't want to flag URLs to private network segments as bad.