Get hostname from a URL using JavaScript

Sometimes we may have strings which contain some UR;s and we may want to retrieve the hostname from the URLs for some statistic use. For example, we may have a URL : http://www.example.com/aboutus.html. We may want to retrieve the www.example.com from the URL. How? Use regular expression. Here I give an example using JavaScript. If you want to check whether a string is a URL or not. Refer to Detect URLs in a Block of Text.

In JavaScript, we can have a regular expression like

var pattern=/(.+:\/\/)?([^\/]+)(\/.*)*/i;

the regular expression pattern can be used to get the hostname. There are three parenthesis in the pattern, they are used to group the strings together and when testing on the target string, the matched string blocks can be remembered and returned as an array. Then we can retrieve the hostname from the returned array. The first parenthesis is to check the protocol of the URL, protocol can be http://, https://,ftp:// or file://. It can have zero or one occurrence of the protocol in one valid URL. The second parenthesis is to match the hostname, anything before the first occurrence of '/' after the protocol string will belong to the hostname. If no '/' present, then the whole string after the protocol string is the hostname. The third parenthesis is to match all the rest after the hostname.

For example, if we have a URL string

var url="http://www.example.com/aboutus.html";

After we run

var arr=pattern.exec(url);

The returned array arr will contain 4 elements. The arr[0] is the matched URL string which is http://www.example.com/aboutus.html. arr[1] contains the http:// which is the string block matched in the first parenthesis; arr[2] is the hostname www.example.com which is the matched string block in the second parenthesis; arr[3] is /aboutus.html which is the matched string block in the third parenthesis.

What if we don't have http:// at the beginning of a URL? We can still use this pattern, it will still return an array of 4 items. The only difference is that arr[1] is empty since no matched string block. It is the same if the URL doesn't have /index.html or any other similar blocks appended, in this case arr[3] will be empty.

So for any valid URL, we can get the hostname with arr[2]. Hope this can help you when you want to know which host of a URL belongs to. This pattern can also be used in other programming languages.