Login

PHP URL Validation Functions

Validating URLs is important to form handling and PHP data processing. Currently there are numerous solutions for validating URLs. This article will take a look at some of the most commonly used methods of validating URLs in PHP: the Regex method and the PHP built-in Filter_validate_URL.

Bear in mind that both methods have their own strengths and weaknesses. These will be examined thoroughly in this article. The objective of this article is to recommend the best possible way of validating URLs based on available choices and results.

The Validation Functions to be Evaluated

Searching Google for "validating URL PHP" or "PHP URL validation" (without quotes) yields the following six result,s which any developer can use and integrate into their own application:

First, the validating function is added to the top of the PHP script. Then all of the URLs to be tested are placed in an external text file (data.txt), which is placed in the same directory as the validating script. This is basically how the URLs to be tested are arranged inside data.txt:

Then this line:

$a=file(‘data.txt’, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

Grabs the contents of the text file (which are actually the URLs listed previously) and puts it in an array variable $a.

Once all of the contents of the text file are in the array, this loop:

This is how the results will look, for example, for regex #1: http://www.php-developer.org/urlvalidation/1.php

Let’s discuss how you are going to analyze each one of these validating functions. First, the URLs numbered 41 to 67 are malformed URLs, and this regex fail to detect 3 malformed URLs (instead of marking them “invalid URL,” they are marked “valid URL.”). So the %slippage can be computed:

%slippage=3/27 =11.11% (the lower percentage the better

The URLs numbered 1 to 40 above are acceptable URLs. This validating function over-judged 14 URLs (instead of marking the URL as valid, it is marked invalid). Therefore, %overjudgement= 14/40= 35% (the lower percentage the better).

Finally, if you need to measure the overall performance of the validating function, you can simply average both %slippage and %overjudgement.

Below is the summary for the rest of the validating functions:

Based on the result, the first validating function (#1) produces the lowest average in terms of %slippage and %overjudgement. Validating function #4 does have zero slippage, but it over-judges 87.5% of the time. So the average % is not good. The filter_validate_URL function, which is the #5 validating function, averages 31.62%. It does have a low %overjudgement, but the problem is its high %slippage.

You can also say that there is no perfect validating function, which has zero %slippage and zero %overjudgement. Each of these functions has its own strengths and weaknesses.

Recommendations

So what is the recommended validating function? Based on the evaluation result, you can select validating function #1 since the risk of %slippage and %overjudgement is minimal. There might still be a lot of functions that can be developed that are not featured here and are more accurate than this function, but so far this function ranks well in Google search. Feel free to test your own function and post the results here if you have a much lower %average. It will be beneficial to everyone.

So how can you increase the accuracy of validation? Below are some further recommendations on how you can further increase the accuracy of your URL validation.

First, you can blend the validating functions. Instead of relying only on a single function, you can incorporate other features from other functions, or some additional functionality to make it stronger. For example, validation function #1’s main weakness is that it will not be able to validate valid capitalized URLs or IP addresses.

One way you can further improve the function is to convert all URLs to lower case before inputting them into the validating function. You can also let filter_validate_URL (validating function #5) handle the validation if the URL contains numbers, which means it is an IP address. The PHP built-in function filter_validate_url seems to be accurate when validating IP addresses as part of the URL.

Second, in PHP, there is a function called parse_url: http://php.net/manual/en/function.parse-url.php. You can feed it the output of validated URLs to retrieve the host name and other information for further processing, instead of using string manipulation functions; it saves time.

To decrease the possibility of overjudgement, you can also feed the invalid URLs for the first validating function to the parse_url function to see if it can retrieve the domain name. You can then check to see if it exists. You’ll find some script here: http://psoug.org/snippet/Check_If_Domain_Exists_31.htm

You can use this function to check if the URL actually exists. It will then provide information, such as the header status of the URL. For example, if it is 404, then the URL does not exist. See the tip provided by vernon at kesnerdesigns.net here: http://php.net/manual/en/function.file-exists.php

As a summary, relying on a validating function alone to verify the integrity of a URL is not a complete solution; instead, other, related functions need to be added to complete the checking.