5 Ways URLs Go Wrong

A tour of the traps and potholes that inhibit us from getting URL naming right

Every programmer knows the adage that there are only two hard things in computer science: caching and naming things. Spinning our spotlight straight to the second difficulty—that of naming things—let’s enquire about what exactly programmers mean by “things”. As I’ve harped on about elsewhere, once upon a time I only considered the problem to naming to pertain to “pure”-code concepts, e.g. parameter names, function names, class names, and module names. Little did I realise that features outside the realm of “pure”-code are also deserving of the same heightened care and vigilance when naming. What features do I have in mind? The names of files that interact with programs, the names of various online advertising tracking profiles, and (as the article title alludes to) URLs. What follows is a tour of the traps and potholes that inhibit us from getting URL naming right.

The Necessity of Stability

For website owners lucky enough to have cracked SEO, each individual landing page URL has a financial worth that can be roughly quantified by measuring the revenue generated by customers who first encountered the business through that very page. If, last year, 4,000 students landed on your web page about the anatomy of the human heart, and if these students bought, in aggregate, 750 pounds worth of anatomy notes, then said URL pays a dividend of 750 pounds per year.

SEO is worth a lot of money, so it’s no surprise that every website owner on this planet (and most website owners on others) are looking for the cheat codes that’ll blast them to the highest highs of Google search rankings. But their efforts are doomed to fail unless they ground their strategy on the most fundamental tenet of good URLs: stability.

The society in which your website finds itself is becoming densely populated. In it are people, bots, APIs, RSS feeds, advertising and data platforms, search engines, the NSA, and god knows what else. In order for any of these external agents to interact with your site (for good or for bad), they first need to locate the content they want—which they do through the URL system, which they assume, more or less, to be an indelible permanent record. As such, any time you modify the public surface area of your website’s URLs, you risk severing your connections to wider internet society. For example, a blogger linking to one of your articles doesn’t have any reasonable way of knowing you’ve moved your articles to a subdomain. As a result, their outbound links will be broken after the URL move change; now, traffic arriving from that blog will be greeted with a 404 error page. Not exactly impressive or reassuring…

The bandage for these bruises consists of this: Program your server to intercept requests for the old URLs and then HTTP redirect them to wherever that content resides stored now. This has the effect that the old surface area of your website remains intact and constant from society’s point of view, if only as a conduit. For this solution to work, it assumes both that you remember to redirect all your modified URLs and also that it’s possible for you to redirect without clobbering or compromising the new URL structure. Having been bitten by these problems before, I’ve come to the conclusion that “I wish I had”. More specifically, I wish I had spent more time thinking carefully about my URL structures upfront, in order to avoid the whole mess I kicked up.

The Five Ways URLs Go Wrong

Everyone likes a story, especially one where the writer makes himself look like a tool. Here’s a story about me completely botching up my critical product URLs, despite having the best of intentions and believing, at the time, that I was following the best best practices. I hope this will be illustrative of what not to do.

Let’s begin with a look at my original URL structure and the reasons I had for adopting it: The typical customer of a notes-selling website like mine has three burning questions about any given package: What topic does it cover? At what university was it written? And how long ago was it written?

What better idea, I thought, than answering all these questions within the URL itself? Not only would the URL be a clear, self-contained description of its contents, but it would be well-suited for marketing purposes: I believed that URLs containing topic, university, and exam year would cause the website to sweep up the longest of search query tails and bring me mountains of traffic.

Before we go any further, I need to explain something about how my notes-selling platform works: Authors our team have already approved have the power to upload their documents through special dashboards. As part of this publishing process, these authors need to provide us with the key trifecta of data about topic, university, and exam year. This data first gets shunted into our database and eventually winds up as the URLs the web application generates. TLDR: Our URLs are a direct consequence of data input by our users.

What could go wrong? Ha!!

1. Users change previously input information.

One year in, an anatomy notes author decides that the wordier “Oxford University” sounds classier than “Oxford” alone, so she logs in, opens up the edit form, changes her university field, and hits save. What are the consequences for us? The previous URL (“/notes/oxford/2013/anatomy”) now returns everyone’s favourite HTTP gremlin: 404. Any external websites (e.g. Google AdWords, Facebook Ads, random blogs) linking to this old URL now drive traffic to a dead page. What’s more, the steady stream of search engine traffic we once took for granted suddenly dries up as Google boots our non-respondent asses out of its index.

How does one mitigate against the risk of users changing information that appears in URLs?

Most trivially, you could avoid placing changeable (read: database-backed) information into URLs in the first place. But what if that info is too important for SEO to leave out?

How about this simple solution then: Deny website end users (and perhaps administrators too) their ability to modify already-entered information used within URLs. Authors can initially enter their university, but after they click save, no further changes are permitted: That initial entry, no matter how flawed, is permanent. (Perhaps there could be a grace period of a few days, since it’s unlikely any significant SEO will build up so quickly.) This is a sort of permalinking strategy.

Yet another possibility, this one for the completionist website owners who want to preserve all the old URLs and also retain the future possibility of end users modifying these URLs: Create a database table of every historical URL ever, and then, whenever an unrecognised URL is encountered, don’t 404 until you’ve searched the table for possible redirects. For a price paid in website performance and engineering time (/library-finding time), your website will effectively have its cake and get to eat it too.

2. Two users have more or less identical data

Say that another Oxford-educated author with anatomy notes created in 2013 signs up: What should the URL for their notes be, given that “/notes/oxford/2013/anatomy” is already taken? My initial “solution” was to append the term “author-N” to the end of the URL, giving me “/notes/oxford/2013/anatomy-author-2”. But this naming strategy failed to memorably (or even clearly) communicate the differences between the note-taking styles and focus of each author. Neither customers nor staff could remember which product was which, let alone distinguish them within on-site product listing pages or off-site search results. Confusion resulted.

What can we do to prevent this problem? Unfortunately, the answer lies outside the current reach of automation. You need to have a human being manually giving unique descriptive names to uploaded products based on each package’s contents. Maybe the first anatomy notes product would be “anatomy-notes-flowcharts” and the second one would be “anatomy-diagrams-collection”. Through this approach, you’ll make your products more distinct and memorable to consumers, and you’ll also ease intracompany communication.

A variant of this idea is available if you don’t have the resources to hire someone to generate unique names: You could delegate this task to the author on your platform creating the product. For example, this would be done through a combination of programmatic checks on the given names (e.g. “Are they unique?”) and inline advice given to uploaders. That said, this approach isn’t ideal, since there’s the very real risk the authors do a lacklustre job.

3. Keywords fall out of date.

With the exception of wines, whiskeys, and cheeses, people generally prefer the new. I found this to be true in my business, whereby students searching for notes in Google often appended the current year. Thus in 2013, history students searched for “2013 Battle of Stalingrad notes”, and in 2014 they searched for “2014 Battle of Stalingrad”, etc. They did this even if the content hadn’t changed in decades.

Let’s say I had 2013 notes stored at the URL “/notes/oxford/2013/stalingrad”. While this would have rocked SEO rankings in 2013, it would have taken a hammering in 2014, even though the content would have been just as relevant to the end consumer.

(I should mention too that many of my authors update their notes as the years pass, further complicating matters with URLs.)

Herein lied the problem: By including perishable components (i.e. years) in my URLs, I found myself in a dilemma—should I keep these URLs the same as time passes or should I update them?

The argument for keeping the 2013 URLs in 2014/2015/etc. would be that I get to preserve existing SEO built up in Google’s index by that particular URL. The downside of this approach would be that (1) I won’t rank as highly for Google queries that include the current year; and (2) I risk confusing consumers who might see 2013 in the URL for a product which might well have been updated for use in 2016 and have on-page H1 and title elements saying the notes are for 2016.

This issue could have been avoided if I had refrained from placing transient information in my URLs. When first building the website, I should have asked myself: “Will any information I plan on placing in the URLs eventually fall out of date?”

How does this restrictive approach affect my legitimate desire to accrue search engine results for search queries containing specific years, like “Stalingrad notes 2014”? Perhaps negatively, but I would do my best to compensate by placing the current year within other on-page elements, like the title tag, the uppermost H1 tag, or my internal links.

4. Nested URLs multiply the number of broken URLs upon change of parent.

In line with the previous examples, let’s say “/notes/oxford/2013/anatomy” is my URL for an anatomy notes product. Now let’s say that this notes package contains ten chapters about individual body parts, and my software gives each of these chapters its own dedicated landing page. I now have URLs like “/notes/oxford/2013/anatomy/chapters/arm-structure” and “/notes/oxford/2013/anatomy/chapters/heart-structure” etc. What happens when the author changes their exam institution from “Oxford” to “Oxford University”? All the chapter URLs, because of their dependence on the stability of the parent, will break, annihilating any SEO they may have built up over the previous years.

What can we do to soften the contagion potential of changes to parent URLs? I gave a few options earlier (e.g. using permalinks or building a database of historical URLs for automatic reroutes). Another option would be to minimise the “surface area of change” within parent URLs: Instead of having three factors prone to change (university, exam year, and subject name), the parent URL, at least when nested, might be restricted to having only one item prone to change (e.g. the subject name). An alternative option would be to avoid nesting at all: The individual chapters might be available for browsing at “/chapters/heart-structure” instead of at “/notes/oxford/2013/anatomy/chapters/heart-structure”.

5. Multi-attribute URLs slow performance.

URLs that depend on many factors to uniquely identify a resource—e.g. “/notes/{university}/{exam-year}/{topic-name}” are more likely to need slow, complicated SQL calls for retrieving the correct record from the database as compared to simpler URLs like “/notes/{topic-name}”. A URL like “/notes/anatomy-notes” only requires that the software searches the database for the notes product with TOPIC_NAME=anatomy-notes—a small, simple call. By contrast, a URL like “/notes/oxford/2013/anatomy/chapters/heart-structure” depends on four factors: the university (“Oxford”), the exam year (“2013”), the subject name (“anatomy”), and the chapter name (“heart-structure”). The corresponding SQL call is much more complicated and may be very slow, since reference is likely needed to multiple database tables. Just imagine how many joins would be needed if we had a notes_products table, a universities table, and a chapters table… it would be glacial.

The solution to this particular woe is guaranteeing that your URLs only need to search through one database table column in order to find the record corresponding to a particular URL. This doesn’t necessarily mean that the URL is restricted to containing only one category of information. For example, you could still program your website to create permalinks that are concatenations of the university name, exam year, and subject name. The permalink would simply be “oxford-2013-anatomy”, a pre-built string that would get stored in a special permalink column in the notes_products table. Now, only this column needs to be referred to when retrieving the record—a comparatively simple call.