TL;DR: Approximately 10% of 1.5M randomly selected unique links in the March 2015 data dump are unavailable. To be more precise, that is approximately 150K dead links.

Motivation

I've been running into more and more links that are dead on Stack Overflow and it's bothering me. In some cases, I've spent the time hunting down a replacement, in others I've notified the owner of the post that a link is dead, and (more shamefully), in others I've simply ignored it and left just a down vote. Obviously that's not good.

Before making sweeping generalizations that there are dead links everywhere, though, I wanted to make sure I wasn't just finding bad posts because I was wandering through the review queues. Utilizing the March 2015 data dump, I randomly selected about 25% of the posts (both questions and answers) and then parsed out the links. This works out to 5.6M posts out of 21.7M total.

Of these 5.6M posts, 2.3M contained links and 1.5M of these were unique links. I sent each unique URL a HEAD request, with a user agent mimicking Firefox1. I then retested everything that didn't return a successful response a week later. Finally, anything that failed from that batch, I resent a final test a week later. If a site was down in all three tests, I considered it down for this test.

Results2

By status code

Good news/Bad News: A majority of the links returned a valid response, but there are still roughly 10% that failed.

(This image is showing the top status codes returned)

The three largest slices of the pie are the status 200s (site working!), status 404 (page not found, but server responded saying the page isn't found) and Connection Errors. Connection errors are sites that had no proper server response. The request to access the page timed out. I was generous in the time out and allowed a request to live for 20 seconds before failing a link with this status. The 4xx and 5xx errors are status codes that fall in the 400 and 500 range of HTTP responses. These are client and server error ranges, thus counted as a failure. 2xx errors (of which was are in the low triple) are pages that responded with a success message in the 200 range, but it wasn't a 200 code. Finally, there were just over a hundred sites that hit a redirect loop that didn't seem to end. These are the 3xx errors. I failed a site with this range if it redirected more than 30 times. There are a negligible number of sites that returned status codes in the 600 and 700 range4

By most common

There are, as expected, many URLs that failed that appeared frequently in the sample set. Below is a list of the top 503 URLs that are in posts most often, but failed three times over the course of three weeks.

Discussion

What can we do with all of this? How do we, as a community, solve the issue of 10% of our outbound links pointing to places on the internet that no longer exist? Assuming that my sample was indicative of the entire data dump, there are close to 600K (150K broken unique links x 4, because I took 1/4 of the data dump as a sample) broken links posted in questions and answers on Stack Overflow. I assume a large number of links posted in comments would be broken as well, but that's an activity for another month.

We encourage posters to provide snippets from their links just in case a link dies. That definitely helps, but the resources behind the links and the (presumably) expanded explanation behind the links are still gone. How can we properly deal with this?

Footnotes

This is how it ultimately played out. Originally I sent HEAD requests, in an effort to save bandwidth. This turned out to waste a whole bunch of time because there are a whole bunch of sites around the internet that return a 405 Method Not Allowed when sending a HEAD request. The next step was to sent GET requests, but utilize the default Python requests user-agent. A lot of sites were returning 401 or 404 responses to this user agent.

Links to Stack Exchange sites were not counted in the above results. The failures seen are almost 100% due to a question/answer/comment being deleted. The process ran as an anonymous user, thus didn't have any reputation and was served a 404. A user with appropriate permissions can still visit the link. I verified a number of 404'd links to Stack Overflow posts and this was the case.

The 4th most common failure was to localhost. The 16th and 17th most common were localhost on ports other than 80. I removed these from the result table with the knowledge that these shouldn't be accessible from the internet.

There where 7 total URLs that returned status codes in the 600 and 700 range. One such site was code.org with a status code of 752. Sadly, this is not even defined the joke RFC.

As soon as you come up with a solution to handle links that are temporarily dead, you can pick up the project Sam Saffron left behind when he tried to solve the dead-link issue. Sam left to do Discourse with Jeff Atwood, so the project is still up for grabs!
– Martijn Pieters♦Aug 6 '15 at 12:54

13

They should be fixed one at a time. With the ones that are actually useful first, it is entirely automatic when SO users run into them. And the ones that nobody cares about anymore ignored. Because nobody cares. Needs no more help than that.
– Hans PassantAug 6 '15 at 13:00

6

@HansPassant, I understand that organically stumbling upon broken links is a good idea, but that's not how low quality stuff is determined, in all cases, currently. The Low Quality queue has heuristics for dumping a post in there without a user flag. The first posts queue is entirely automated. Both of these provide a preemptive way of improving quality or removing low quality.
– Andy♦Aug 6 '15 at 13:10

2

You could probably come up with a good age/views threshold by looking for links that have been edited in the past. I'd be willing to bet that the ones that are getting fixed organically are on posts with more votes and views on average.
– Bill the LizardAug 6 '15 at 13:11

1

per my observations, at posts that provide summaries of linked content, harm of link rot appears tolerable. As for posts more likely to suffer substantially (link-only-answers), it seems that Stack Exchange team already implemented some automated bot that discovers such answers and pushes these to LQ review queue
– gnatAug 6 '15 at 14:12

1

@MartijnPieters, I've added some more details based on your feed back. Roughly 2/3 have a score other than 0. 25% of posts with broken links have less than 200 views.
– Andy♦Aug 6 '15 at 16:31

1

@Andy: can you add a age / views ratio, please? Just views or age, on their own, carry very little meaning. If an old post with a lot of views has a broken link, that's worse than a new post with only a few views. An old post with a few views is also not as interesting as a new post with the same number of views.
– Martijn Pieters♦Aug 6 '15 at 16:44

4

I've found a couple statistics. A Harvard study showed that 49% of links mentioned in Supreme Court opinions are gone. For example, this link was cited. Pintrest has reported about a 5% rate of link rot per year - search for "5%" to find the relevant quote, making a distinction that they are saving bookmarks - which is inherently something you want to save, not just a random link.
– Andy♦Aug 6 '15 at 18:25

4

@Andy: thanks, that aligns with my expectations; most dead links are in posts noone looks at so they don't get fixed or flagged.
– Martijn Pieters♦Aug 6 '15 at 18:47

5

I like the link review queue idea, why did it die? It seems like one could prepare a list of dead links by post popularity and topic and just put up the lists in an available working area, I'm sure some people would correct them for the combination of edit reputation and curiosity to crawl around the more popular links in a topic. I would certainly do a fair number of python related links before getting bored.
– Ezekiel KruglickAug 6 '15 at 21:57

4

I think the link review queue wouldn't be terrible either. While there may be 150,000 links to review at the inception of the queue opening, I highly doubt that it gets very many on a daily basis and given enough support the community could easily work through that many after a few months. The new ones to review once the total was brought in line should be minimal in my opinion.
– Travis JAug 6 '15 at 22:03

3

If you can provide me with an ARFF file, or at least a CSV file of your data, I can run it through some machine learning algorithms to see if a computer can effectively identify posts which are likely to have broken links. Then an algorithm could identify those posts and take action, like check the link and, if it is broken, notify a moderator.
– djhaskin987Aug 6 '15 at 22:52

@RowlandShaw, example.[com|net|org] are all valid domains. They resolve and return a 200 message. localhost doesn't resolve on the internet, which is why I excluded it.
– Andy♦Aug 7 '15 at 12:28

4

note that 81% of the broken links are for the top 4 links. So if we can fix them we will have instead of 10% of broken links, 2%. The most obvious solution would be that SE accept to pass a batch that will updates links. Another way could be use the Documentation : every times you see a broken links you edits to the relevant documentation or create/ask for some knowledgeable people to create it.
– WalfratJul 28 '17 at 12:26

12 Answers
12

I really think that, at least at this point, there isn't a problem. To the extent it is a problem, it is difficult to fix.

Stack Overflow is meant to be a Q&A site, not a repository of links. Encountering a dead link is an annoyance, but it doesn't instantly invalidate the answer, and often barely has any impact at all. This site has a policy of encouraging answers consisting of more than links exactly for this reason: so even if the link dies, the answer still survives and remains meaningful. If an answer consists of just a link, then this is the problem, not the dead links. I'd go as far as to say the question hasn't really been answered.

Many of the links are dead simply because the resource they pointed to has been moved to a slightly different location that any user could discover with a tiny bit of effort (for example, typing the name into Google). Take the link http://www.eclipse.org/eclipselink/moxy.php for example. Even though I don't trust casual users to actually fix the link, I do trust them not to be total idiots and just google eclipse moxy and follow one of the top three results to the new location.

In other cases, it's simply impossible to fix a link at all, except by a person who is familiar with the subject. This is a more significant problem, but unfortunately not one that is fixable automatically.

For example, take the link http://www.db4o.com, to the object database db4o. db4o hasn't existed for a while now and is no longer supported by the developer. You might be able to find the source code or the binaries, but I would not fix the link to point to them, because I would not recommend it to anyone (since it's dead). The problem is not really that the link is dead, but rather that the product has ceased to exist, and the answer that recommends it is no longer valid. It can only be fixed by posting a new answer, voting, and comments. These things might already exist on the questions you looked at.

Also, a major problem with any automatic scheme to fix dead links is the potential for error. A link that points to something else, or to something that is no longer a valid answer, is a lot worse than a dead link, in the same way that misinformation is a lot worse than a lack of information. It really might confuse users, or have them using outdated software.

If the bulk of the dead links continues to grow, and if popular answers get hit as well, I really would like to do something about it, largely because it makes the site looks dated and unprofessional. As it stands, an attempt at fixing it would be nice, but not something I think is important. Personally, I have encountered very few dead links as a casual user.

If a link is incidental it doesn't matter too much if it goes bad, but I've encountered enough link-only answers to think that SO needs to be much stricter about rejecting link-only answers, whether the link is dead or not. At the moment it's not uncommon to have a 'Not an answer' flag on a link-only answer rejected.
– Ian GoldbyAug 7 '15 at 12:02

3

@IanGoldby: The opposite problem also exists: answers that are not link-only in any real sense are flagged and, not infrequently, deleted. LQP reviewers are just not all that accurate.
– Nathan TuggyAug 8 '15 at 1:57

The world wide web's sole purpose was to link relevant documents together. With no (working) links, there's no web.

So I think every effort that can be undertaken to fix broken links, is a good effort.

We shouldn't rely on users fixing their own post. We have way more inactive than active users.

Perhaps there could be something like a "broken link queue", where users can report a broken link (A) and suggest a replacement (B). Then when agreed upon by reviewers and/or moderators, the system (Community user) could replace all instances in all posts of link A with link B.

Of course this is very spam-sensitive, so the actual implementation details needs to be worked out pretty tight.

I don't fully agree with you, and proposed a different answer that combines your answer with also the answer of Jan Doggen. If we step beyond the original posters we might lose critical knowledge that can be used to fix these links. Therefore I think it would be best to inform the posters that their links are dead. If no action results from this your option should be implemented.
– LuuklagAug 7 '15 at 11:55

My answer is aimed towards often-used links like links to blog posts and documentation. I don't care that much for links that occur in exactly one post.
– CodeCasterAug 7 '15 at 12:05

2

Of course priority needs to be there were most benefit is to be gained. However we also want people to research there questions and hopefully find answers before they need to write their own question. Any broken link that prevents this is worth salvaging if you ask me.
– LuuklagAug 7 '15 at 12:07

1

+1 for the idea of a broken link queue, but I don't think automatic replacement of all links is the way to go, at least unless it's simply a matter of the url simply having been changed. For other things, it seems blocks of 10-20 questions might be a better way, and first asking for alternative links, then asking for second opinions on which link is best. A similar approach could be used to test for outdated questions, though the queue would automatically generate. Then again, a broken link queue could probably be auto-generated as well.
– NuclearmanAug 9 '15 at 10:54

Don't spam the queue. Put broken links warning in the post body somewhere, like wikipedia does with any issues on the page that needs fixing. Make sure it draws enough attention, but not overly obtrusive. Those who know the area will fix the links if they happen to be there. Those who came to visit will not be frustrated as much. Why - if the question is not popular, fixing and even reporting broken links has little value.
– NeoliskNov 18 '16 at 2:14

The broken link queue should focus on editing and fixing the links in a post (as opposed to closing it). It'd be similar to the suggested edits queue, but with the focus intended to correct links not spelling and grammar. This could be done by only allowing a user to edit the links.

One possibility, I envision is presenting the user with the links in the post and a status on whether or not the link is available. If it's not available, give the user a way to change that specific link. Utilizing this post, I have a quick mock up of what I propose such a review task looks like:

All the links that appear in the post are on the right hand side of the screen. The links that are accessible have a green check mark. The ones that are broken (and the reason for being in this queue) have a red X. When a user elects to fix a post, they are presented with a modal showing only the broken URLs.

With this queue, though, I think an automated process would be helpful as well. The idea is that this would operate similarly to the Low Quality queue, where the system can automatically add a post to the queue if certain criteria are met or a user can flag a post as having broken links. I've based my idea on what Tim Post outlined in the comments to a previous post.

Automated process performs a "Today in History" type check. This keeps the fixes limited to a small subset of posts per day. It also focuses on older posts, which were more likely to have a broken link than something posted recently. Example: On July 31, 2015, the only posts being checked for bad links would be anything posted on July 31 in any year 2008 through current year - 1.

Utilizing the Wayback Machine API, or similar service, the system attempts to change broken links into an archived version of the URL. This archived version should probably be from "close" to the time the post was originally made. If the automated process isn't able to find an archived version of the link, the post should be tossed into the Broken Link queue

When the Community edits a post to fix a link, a new Post History event is utilized to show that a link was changed. This would allow anyone looking at revision history to easily see that a specific change was only to fix links.

Actions performed in the previous bullets are exposed to 10K users in the moderator tools. Much like recent close/delete posts show up, these do as well. This allows higher rep users to spot check (if they so desire). I think this portion is important when the automated process fixes a link. For community edits in the queue, the history tab in /review seems sufficient.

If a post consists of a large percentage of a link (or links) and these links were changed by Community, the post should have further action taken on it in some queue.

Example:

A post where X+% of the text is hyperlinks is very dependant on the links being active. If one or more of the links are broken, the post may no longer be relevant (or may be a link only post). One example I found while doing this was this answer.

I don't think that this type of edit from the Community user should bump a post to the front page. Edits done in the broken link queue, though, should bump the post just like a suggested edit does today. By preventing the automated Community posts from being bumped, we prevent the the front page from being flooded, daily, with old posts and these edits. I think that the exposure in the 10K tools and the broken link queue will provide the visibility needed to check the process is working correctly.

Queue Flow:

Automated process flow:

The automated link checking will likely run into several of the problems I did. Mainly:

Sites modify the HEAD request to send a 404 instead of a 405. My solution to this was to issue GET requests for everything.

Sites don't like certain user agents. My solution to this was to mimic the Firefox user agent. To be a good internet citizen, Stack Exchange probably shouldn't go that far, but providing a unique user agent that is easily identifiable as "StackExchangeBot" (think "GoogleBot"), should be helpful in identifying where traffic is coming from.

Sites that are down one week and up another. I solved this by spreading my tests over a period of 3 weeks. With the queue and automatic linking to an archived version of the site, this may not be necessary. However, immediately converting a link to an archived copy should be discussed by the community. Do we convert the broken link immediately? Or do we try again in X days. If it's still down then convert it? It was suggested in another answer that we first offer the poster the chance to make changes before an automatic process takes place.

The need to throttle requests so that you don't flood a site with requests. I solved this by only querying unique URLs. This still issues a lot of requests to certain, popular, domains. This could be solved by staggering the checks over a period of minutes/hours versus spewing 100s - 1000s of GET requests at midnight daily.

With the broken link queue, I feel the first two would be acceptable. Much like posts in the Low Quality queue appear because of a heuristic, despite not being low quality, links will be the same way. The system will flag them as broken and the queue will determine if that is true (if an archived version of the site can't be found by the automated process). The bullet about throttling requests is an implementation detail that I'm sure the developers would be able to figure out.

Very nice answer. My only objection to your supposed review queue would be the ability to only update a link. Whilst that might result in less posts being bumped to the front page it limits people to incorporate possible other edits that could improve the post. For example I am currently working on this: meta.stackoverflow.com/questions/256623/what-to-do-about-macros and retagging a lot of old questions. I also edit the words macro inside the text to something regarding VBA. Besides that I tend to remove noise like Thank you etc. at the same time,so improving more then just tags.
– LuuklagAug 7 '15 at 12:42

3

Like @Luuklag, I think having an option (perhaps a separate review action) to do more thorough edits would be quite desirable. If this is a 2k queue or higher, it shouldn't make any difference; alternatively, if it's a 500 or 1k queue, only allow those with 2k to make freeform edits in the queue (and allow anyone to fix links only).
– Nathan TuggyAug 8 '15 at 2:00

I also agree with the "this post needs more edits" function as well. If the link-review queue is for <2000 rep, then it acceptable (in my opinion) to obfuscate that option otherwise it can be displayed as normal. But otherwise, this is a decent suggestion.
– Draco18sJun 12 '17 at 18:20

We don't; at least, not on the scale you're proposing. If a link is important enough that a user followed it to find it was broken, then it's important enough for that user to make an effort to find the current correct location for that link, and edit the offending post to correct it (or remove the link). That's just good etiquette, and honestly how much of an effort is it to plug a URL into the Wayback Machine and see what comes up? Five minutes?

Not all problems need solving. Not all problems need solving with automation.

If there are hundreds of thousands of broken posts with a handful of links to well-known resources, perhaps someone on the crew could perform a database update to fix these links to their new location.
– CodeCasterAug 7 '15 at 10:58

6

I kind of agree with you, in that this isn't a problem that needs solving. However, your description of the responsible user that goes and looks for a replacement link is kind of fantastical. That's not what's going to happen. The links would just not get fixed, unless they appear in very popular questions.
– GregRosAug 7 '15 at 11:22

3

"user", in this case, assumes that it is someone that is familiar enough with Stack Overflow and not a random Googler that happened upon a post.
– Andy♦Aug 7 '15 at 12:08

I never knew there was an archive of the whole web. So useful! Thanks for that.
– IanJun 12 at 10:04

While doing this analysis, I informed some posters about broken links. The response I got back ranged from a "Thanks I've updated it", to "That resource appears to be gone, I can't find a new one", to "You can fix it" to simply being ignored.
– Andy♦Aug 7 '15 at 12:07

3

@Andy Very noble of you to inform the original posters. To bad even a user as reputable as yourself gets such degrading answers. Therefore it is perhaps best to automate this process of notification as I proposed in my answer. In that way people that are willing to update their links are given the chance to do so, and people that don't want to do anything just can delete the notification and be done with it. In such way nobody gets demeaning messages and we can all live happily ever after.
– LuuklagAug 7 '15 at 12:27

1

Actually, just update it. That's one of the reasons we have editing. @Luuklag: You only have to be sure it's a good replacement, and you give a good edit-comment.
– DeduplicatorAug 7 '15 at 22:50

We need a link checking mechanism that functions the way you did plus a new 'Dead links' review queue.

The only proper way to handle dead links for the SE sites (that strive for high quality) is human intervention: can we find a replacement link (best), otherwise edit and remove the link.

This has the advantage that you have a systematic approach to correcting links by people who want to, instead of leaving it to the whims of the person who stumbles upon the dead link, or of the original author (if notified).

I think combining some of the answers I read here would be most promising. My answer combines the asnwers of CodeCaster and Jan Doggen

If the SE servers automatically determine once every given time step if links are still alive it can then automate a follow up action.

Making the assumption that the original poster of the link has the best knowledge available to replace the broken link. Therefore the most promising, and therefore first effort should be to inform the original poster that their link has died and request them to replace it or edit their post.

If this does not resolve the problem within a given time (two weeks, maybe a month) it would be safe to assume that no action will be taken by the original poster, either because he doesn't want to do so or is inactive.

If this has been established it could be added to a dead link review queue for other users to resolve. This however should only be applicable to highly respected, and therefore trusted users, as the chance of someone adding in spam or maybe even malicious links is rather large.

"the original poster of the link has the best knowledge available to replace the broken link" - what does this mean? Why should we care about the meaning of the original poster? If there's a link to a specific documentation page or blog post that was poorly moved (no "permanent redirect" response but simply "not found"), then we can easily let the link be replaced automatically.
– CodeCasterAug 7 '15 at 11:58

1

Well if it is obvious that the content has been moved to a different location and that is the reason why it is no longer accessible of course we should automate that. I personally encountered a bunch of uploaded images that were pasted as hyperlinks that no longer exist. In such a case the most potential lies with the original poster of the hyperlink. On a side note there was also a discussion about images that no longer exist, perhaps we should merge these two discussions. meta.stackoverflow.com/questions/300862/…
– LuuklagAug 7 '15 at 12:03

Maybe we should create another review queue especially for posts which have broken links or content? And allow people to vote by using existing 'flag' functionality with additional option to report 'the post contains broken links', so people can flag that kind of posts?

Then the reviewer can investigate and decide whether the link should be replaced by either the proper updated one, internet archive one or remove it at all.

If it's not worth to create separate queue for that, this can be placed under 'low quality' with additional explanation in the description:

This answer has severe formatting or content problem ... or it contains broken links

then when reviewer can see the post has been marked as low quality, and the same way as we inform that this post is potential spam, we can inform reviewer that this post has been detected to have broken links (which was tested previously by the script, the same as they're checking for the spam content).

The community user looks at and bumps old questions. Why not let that user throw a HEAD request at all the links contained in that Q&A like you did?

If one link is dead, notify the author.

That will produce a constant stream of detected dead links. The rate of found links can be increased by increasing the activity of the community user.
The benefit would be that there's no new gigantic 150k review queue that pops into existence. Additionally, not all links are repeatedly re-checked as in your big check. Traffic is still not free these days and checking all links on SO in short periods might not make new friends.

Then let this happen for a while and do a big scan on all links every 2-3 months (or longer period) to get a bit of an idea how the system behaves.
How many links died in that time? How many dead were fixed?

If it doesn't help, additional actions could be:

creation of a review queue that the community user adds posts with dead links to

if a link remains dead and the author takes no action, let the community user cast a downvote

I'm not sure how your answer is much different form this one?
– reneOct 21 '15 at 15:35

@rene I have to admit that I only skimmed the post. My suggestion would be to first take a look at how this apparently huge number of 150k dead link changes over time to determine how much effort should be put into fighting the dead links. Maybe a growing awareness of the problem will solve it and making the community user run some checks could be implemented more easily than adding another review queue (just a guess on my part). Let's first get some statistics on this, we have the first data point.
– nullOct 21 '15 at 15:57

In some cases the original answerer cannot fix the problem. I posted one answer that had a link, was accepted, and now the link is dead. Because my answer was accepted I cannot delete it. (I get a trickle of downvotes, not that that really matters).

Then an algorithm could identify those posts and take action, like
check the link and, if it is broken, notify a moderator. –
djhaskin987

I like this idea, but the last thing moderators need is an extra queue to deal with. Automate dead link detection, yes, but notify the two groups of people that can really do a thing about it.

That first group of people to notify is the initial contributors of those links to Stack Overflow when those links were live to begin with. If they found it once, they're the best person to find it again a second time, or to find an alternative, or to judge if the link can be deleted safely, or to judge if their original post containing the link is no longer relevant and can be deleted entirely. These notifications could simply be sent to their inboxes.

Receive too many of these notifications, and this may even keep members from wanting to contribute too many outbound links, if they're going to be the ones responsible for maintaining them in the long run.

That second group of people to notify is anyone who pulls up a page with dead links on it. And since those people are presumably looking at the page in question, it's probably best to show them that the links are dead from within the context of the page itself. May be a small crossed skull could be placed next to each dead link, or perhaps a different graphic or color.

My point is that Stack Overflow could mark those links somehow. This would save some time for the users who do not want to load 404s, but this would encourage the users who wouldn't have clicked on the link in the first place because they knew that particular linked resource very well, to actually try to correct the dead link to a good one if they can.

And since this isn't a perfect solution for each and every dead link out there, and many dead links will still remain. The final thing I would do is to lower slightly the priority/ranking of answers in search results that have contained dead links for too long. This would be done for Stack Overflow's own internal search through indexing and for Google searches through sitemap.xml.