Off the top of my head, I believe <host> can consist of alphanumeric characters, hyphens, and dots.

Looking back at that RFC, that's about right. At least, it was until recently, when internationalized TLDs were introduced with non-latin characters.Luckily, even with internationalized TLDs the syntax of a URL still follows that original pattern. So in order to account for them, I'll just have to use a more generic pattern for the <host> than I normally would and make sure whatever regular expression engine I'm using can work with multi-byte character sets.

"anything other than a port separator AKA colon, a path separator AKA forward slash, or any kind of whitespace, three or more times" should do it. It's pretty generic, but combined with the beginning and end of the pattern to anchor it there shouldn't be many false positives and any misses will err on the side of removal.

Now, the path separator and path are also optional, however there isn't always a path when there's a path separator yet there is always a path separator when there's a path. So what I'll do is wrap both them both in a "zero or one time" sub pattern, then since a question mark is what marks the next section of a URL I'll use an "anything other than whitespace or a question mark zero or more times" after the path separator.

Now, one thing that's not included in that RFC, is a mention of the <hash> (http://domain.tld/path?searchpart#hash) probably because the hash is used by the browser only and never actually sent to a server. The <hash> works similarly to the <searchpart>, but either of them can exist without the other being there.

The pattern as-is will catch the <hash> already, but only if there's a questionmark before it. Since the part of the pattern catching the <searchpart> is so generic, I can swap out that "\?" with a "question mark or pound symbol" and have it catch a querystring and/or a hash.

Luckily, I caught an un-escaped forward slash before I posted this. I don't know if you can use alternate delimiters in Ruby, but if you can use something other than the traditional forward slash when working with a URL and regular expressions, you should. I normally use the pound symbol when working with regular expressions, but since there's one in my pattern this time, I'll use a tilde instead.