RSS

Pinboard Blog

Recent Bounciness And When It Will Stop

Over the past week there have been a number of outages, ranging in length from a few seconds to a couple of hours. Until recently, Pinboard has had a good track record of uptime, and like my users I find this turn of events distressing.

I'd like to share what I know so far about the problem, and what steps I'm taking to fix it.

The outages come in two flavors. In one, the server sees a big spike of load, and bogs down. It is not obvious what overwhelms the server, but the load average rockets up to scary levels, the machine runs out of physical memory, and processes start to get killed.

The effect of this on the site is a little bit like a time warp. Everything works, but thousands of times more slowly, and user connections start to time out. From the users' perspective, the site becomes completely unresponsive.

The second flavor of outage is a database problem where mysql sees too many incoming connections. The database eventually throws up its hands and stops communicating with the world until I manually reset it. In this case, users see the site downtime page (which loads super fast!)

In both cases, the database continues to work normally, and user data is not at risk. The API and RSS feeds, which are on another machine, also work normally, so users who interact with the site via apps may not even notice a problem. But the website itself goes offline.

To fix the problem, I need to answer these questions:

  1. Are the two types of outages related, and do they share the same underlying cause?
  2. What is causing the spikes in load?
  3. What is causing the excessive database connections?
  4. Is the cause extrinsic or intrinsic?

Finding these answers will require better monitoring. While I have good visibility into application performance (how much traffic there is, how many bookmarks are being saved), the information I collect about server health and status has not been adequate.

There may be one or more episodes of trouble ahead while I add tools to capture this data, but at that point I should have the information to fix (and blog about) root causes. I'm also making an effort to keep a closer eye on the site, to minimize the impact when an outage does occur.

There are some additional steps I plan to take:

  • Automate the status page, which right now is ad-hoc and rudimentary.

  • Create an 'admin lite' page that users can see, with graphs and stats and all the other information about the health of the site.

  • Create a separate Twitter account for Pinboard status, so people don't have to wade through my trademark sassy, in-your-face brand identity to find out what is wrong with their bookmarks.

  • Move linkrolls to a separate server. Right now these are served from the main site server for no discernible reason; moving them will reduce the number of things that break if the main site goes down.

Finally, a caveat. The part of the site with the worst uptime record is me. When I am asleep, or out in the world, or if (as happened this morning) my home Internet connection breaks, the site has to fly on autopilot. There can be a long delay between the time a problem shows up and the time I can show up to fix it.

At the same time, as I have demonstrated to myself in the past, if I try to remain online and on-call around the clock, I will go insane.

I look forward to the day when I can hire full-time help for operations. Unfortunately, in boom times like these serious sysadmins are worth their weight in gold, and sysadmins tend not to be light people. This is a real drawback of a one-person site like mine.

If you can't stand not having an ops team with headsets sitting in a darkened room looking at a wall of screens 24/7, please consider switching to Diigo or Delicious, both of which have a couple of dozen employees and presumably a full-time ops staff. Send me email, and I will close your account and refund your signup fee.

For those willing to tough it out, I'm optimistic we can soon get back to the pleasant days of uninterrupted service. I'd especially like to thank all the people who have volunteered their advice and help. I've received several good ideas about how to improve monitoring and uptime, and will be putting them into action this week.

As always, come ask questions/yell at me on Twitter.

—maciej on June 14, 2012



Pinboard is a bookmarking site and personal archive with an emphasis on speed over socializing.

This is the Pinboard developer blog, where I announce features and share news.




How To Reach Help

Send bug reports to bugs@pinboard.in

Talk to me on Twitter

Post to the discussion group at pinboard-dev

Or find me on IRC: #pinboard at freenode.net