Since Pinboard has collected a lot of bookmarks at this point, I thought it would be interesting to actually run the numbers on link rot - the depressing phenomenon in which perfectly healthy URLs stop working just a few years after appearing online.
Link rot in my own bookmarks is what first inspired me to create Pinboard, a personal archive disguised as a social bookmarking site. As I've shilled before, Pinboard is the only website that will store full page content for the kind of champagne-swilling fat cats willing to pay us a $25/year fee.
But while link rot motivated me to build the site, until recently I did not have enough user data to actually quantify the problem. I was particularly curious to see whether link rot would be linear with time, or if links would turn out to have a half-life, like plutonium. Here's what I found:
To make the pretty graph, I wrote a script that pretended to be a recent version of Safari (with cookies enabled) and sampled 300 URLs at random for each year between 1997 and 2011. Pinboard has only existed since 2009, but the site preserves the stated bookmark creation date on import, and many of our users have been bookmarking since the days of clay tablets and gopher. I excluded all private bookmarks from the data set.
Along with the pretty graph, I've published the detailed results by year here. Links appear to die at a steady rate (they don't have a half life), and you can expect to lose about a quarter of them every seven years.
Now for some methodological hemming and hawing.Measuring link rot with a computer program is tricky because URLs have a number of different decay products:
- Truly dead links. These give an HTTP error code and are easy to catch.
- Page gone, site lives on. Many sites use a boilerplate 'Not Found' page, that returns a valid (200) HTTP status code. To the naive crawler, this looks like a live link. To catch these dead links, I performed an additional check for the number '404' or the phrase 'not found' (case insensitive) appearing in the page title.
- Redirected for your convenience. Some sites redirect their dead link traffic to a landing page they think will be useful to you or lucrative to them. Newspaper sites seem to be particularly fond of this approach. The only way to check for this is by actually clicking each link.
- Dead domains. When a whole domain dies, it tends to end up as a 'parked domain' page stuffed with ads. This looks superficially like a normal web page and returns a successful (200) status code.
Lacking the time to check links by hand, I was only able to catch dead links in categories 1 and 2.
Another problem with my data sample is that it does not include URLs that people pruned from their bookmarks, either when they imported into Pinboard or later on. So there's some survivor bias here making links look more durable than they are.
Since we don't have more than a few thousand URLs from before 2000, I would put less faith in the numbers for older links. In more recent years I was able to draw from a pool of tens of thousands of URLs, so I have more confidence in those results.
In addition to the pretty graph, I've put up detail pages showing the URL data for each year, as well as a raw dataset of URLs and their associated HTTP codes if you'd like to do your own analysis. Please let me know if you do and I'll link it from this blog
I would also be very interested to see this kind of analysis from other sites that collect user-submitted links, particularly ones that have been around for a while. The Wikipedia page on link rot links to a number of research papers on the topic (download them quick, while they're still accessible!). Unfortunately many of these are marred by either focusing on an overly specific species of link ('scholarly publications' or other .edu crap) or covering a short span of time.
Here are some of my own open questions regarding link rot:
- How many of these dead URLs are findable on archive.org?
- What is the attrition rate for shortened links?
- Is there a simple programmatic way to detect parked domains?
- Given just a URL, can we make any intelligent guesses about its vulnerability to link rot?
Please catch me on email or Twitter if you can shed any light, or if I've made big mistakes in the data analysis.
—maciej on May 26, 2011
Pinboard is a bookmarking site and personal archive with an emphasis on speed over socializing.
This is the Pinboard developer blog, where I announce features and share news.
How To Reach Help
Send bug reports to bugs@pinboard.in
Talk to me on Twitter
Post to the discussion group at pinboard-dev
Or find me on IRC: #pinboard at freenode.net