Pinboard Blog

« earlier later »

Thoughts on Colocation

After a week of slogging servers around northern California, I thought a brain dump on colocation might be useful to readers, and to future me.

I wrote about the difference between colocation, leased servers and other kinds of hosting in an earlier post. This one is strictly about colocation, the 'Condo' approach where you own a bunch of hardware and need a place to put it.

What you are after is not complicated:

  • A physical enclosure
  • Some kind of Internet connection
  • Power
  • Security guards
However, buying it is a pain in the neck.


It's typical to sign a one- or three-year contract. Right off the bat this introduces an element of pressure, since you're making a fairly binding, long-term decision without knowing what you're doing. Unless you live in the Bay Area, colocation is a cost on par with your rent or mortgage.

Worse, I've found that the cost for identical configurations can vary by a factor of three or more. You really do need to shop around. And you need to be especially careful of punitive terms for things like overstepping bandwidth or power requirements. These are things you have to dig out of the fine print of contracts at a moment when you just want the whole process to be over.


Renting colo space feels like buying a car. You typically decide on a specific configuration you want, and then ask for quotes. For the salespeople involved, this is just the start of a long conversation they want to have with you. They'll be very curious about your budget, and want to talk about the hosting equivalent of an underbody clearcoat (various "hybrid cloud solutions") and extended warranty.

Although colo space is a commodity, salespeople become tetchy if you treat it as such. They will insist on talking to you over the phone and bristle at the suggestion that their job could be replaced by a web form. It is a good idea not to think about how much their salary or commission adds to your costs.

The reason I say colo space is a commodity is not because all facilities are the same, but because small-time clients will have no practical way of assessing their quality. There are certainly some facilities that are obviously bad, but most data centers have sane policies, look nice if you visit, and talk eloquently about uptime. The only way to evaluate a data center is to go through a series of small and big outages together, but by then you're already in a multi-year contract. So in practice, there is not enough information to pick a "better" data center, just a bunch of anecdotes on message boards. I believe you're better off getting space in two cheap places than trying to pick one high-end one.


A lot of colo space is resold through intermediaries. This is often the only way to get a smaller amount of space than a full cabinet. Someone rents a bunch of cabinets at a data center and then parcels them out by the slice to clients, in pieces as small as 1u.

There are two things to watch out for in this arrangement:

  1. A bunch of people will have access to your physical equipment. Good resellers will take pains to introduce the various people in a cabinet to one another (or at least provide contact info).
  2. If your provider gets in a dispute with the data center, you may not be able to physically take out your hardware. It may even be seized without advance warning.


Power is the great bugbear in hosting. You need to know how much your equipment uses, and how much you're likely to need. It is also often the limiting factor.

A useful rule of thumb in my case has been 1 rack unit = 1 amp. However, it is quite difficult to estimate the power consumption of a server before you buy it. You end up having to plug it in to a Kill-A-Watt meter under normal load.

A full cabinet is somewhere around 42 units, but a typical full cabinet power allotment is 20 amps. So you can't just fill a cabinet with servers.

The power situation is even worse than it sounds. That 20 amp figure is strictly theoretical, since you aren't allowed to use the full amount. You are limited to 80% of this figure, so a "20A" cabinet has 16A usable power, enough for eight pinboard-style servers.

Finding Datacenters

Finding information is another pain in the neck. The place of choice is an awful forum called WebHosting Talk. There is an open business opportunity to anyone who can stick a front-end interface on this site that lets you enter number of servers, bandwidth, physical location, and spit out a list of offers.

Another business opportunity is to make an authoritative directory of Bay Area data centers, since there is a bewildering assortment of sellers, re-sellers, and re-re-sellers offering the same physical space. Conversely, some large providers maintain multiple facilities.


Nobody likes to talk about earthquakes. But anything in the Bay Area or Seattle is going to come crashing down at some point. Another thing that has proven impossible is finding out what facilities are at highest risk. It's one thing to go offline when the Big One comes (half the Internet will be down with you). But losing a rack full of hardware into the maw of the earth is worse.

So here's my plea to hackers: figure out where stuff is physically hosted, correlate it with seismic hazard maps, and make a nice web form that lets people shop for specific power/bandwidth/space configurations without talking to salespeople. Charge money for it! I will pay you! Others will pay you!

—maciej on August 03, 2013

Seeking Bay Area Colo Space

I succesfully moved my backup servers to Sacramento this week, but I'm still looking for a Bay Area colo for the main site, which has outgrown its current home.

Here's what I'm hoping to find:

  • Half or full cabinet
  • 100 Mbps capped bandwidth
  • 20 A @120 V
  • One or three-year contract
  • Full 24/7 physical access, within 100 km of San Francisco

The best offer I have in hand right now is from HE at their Fremont 2 facility, who are asking $600 per month for a full cabinet, but with skimpy power (15A) and a $200 setup fee to install square-hole posts.

Make me a better offer and I'm yours! Email me at

—maciej on August 02, 2013

Upcoming API downtime

I've found out on short notice that I must vacate my current hosting facility before the end of the month. This will mean physically moving about eighty pounds of PInboard machinery from San Jose to a new home in Sacramento.

The servers involved run the Pinboard API. I'm going to try switching all API traffic to the web server (which is hosted elsewhere and will not be affected by the move) but if it turns out to be too much load, I will need to take the API offline.

In the pessimistic case, the API will be down from early Monday evening California time until Wednesday morning.

The outage will also affect archiving. About half of PInboard users won't be able to reach their archives during the outage. I will extend affected archiving accounts by one week as compensation.

The website and RSS feeds will remain up and running.

Nothing makes me feel more alive than a midsummer motor car ride through California's gorgeous Central Valley. I apologize to my users for this self-indulgence, and promise not to make it a regular habit.

—maciej on July 28, 2013

Pinboard Is Four Years Old

Today marks four years since I opened the creaky gates and started charging customers money for Pinboard.

Here are some site stats for this year, compared with one and two years ago:

2010 2011 2012 2013
bookmarks 3.5 M 27 M 53 M 76 M
tags 11 M 76 M 135 M 178 M
active users 2.8 K 16 K 23 K 23 K
bytes archived 200 G 3.0 T 5.9 T 8.8 T
downtime 6 h 29 h 22 h 12 h*
unique URLs 2.5 M 16 M 32 M 48 M

* this is site downtime; API downtime was much worse, perhaps 48 hours in all

The site has continued to grow at a steady clip, adding as many bookmarks and tags this year as last year. Total revenue from signups and archiving has been far steadier than I expected from a web project, which by nature tends to be spiky. This comes as a considerable relief, since it means I don't have to hunt for a new brand of champagne or truffle oil every other month. The number of active users has found a steady state, with as many people joining the site as dropping off of it in any given month. Before growing the site much further, I would like to get better at handling support requests at this level.

It's been an active year behind the scenes. On top of the usual code gardening, I spent some time working to better secure the site, introducing API tokens for password-free authentication, moving everyone to TLS without breaking everything, and adding various cookie flags and HTTP headers to make the site a little more resilient to bad people in public places.

The major new features this year were tag bundles, privacy lock, major improvements to search, and the beginnings of a bulk tag editor.

I gave talks on Pinboard at Brooklyn Beta, CUSEC, and InfoShare Gdansk, and in the process got to meet Pinboard users in Osaka, Warsaw, Paris, Lyon, Stockholm, London and Berlin. This was an enormous amount of fun for me, and really helped to get me out of the house.

Finally, I made my first foray into venture capital with the Pinboard Investment Co-Prosperity Cloud. The six winners have been hard at work, and I look forward to writing about what they've achieved in the coming weeks.

Thanks to everyone who has helped me with this project over the past four years, in big and small ways. July 9 is one of the happiest days of my year, and I owe it to all to kind people from across the Internet.

—maciej on July 09, 2013

Persuading David Simon

[June 19 update: David Simon has been kind enough to respond at length here. He points out that I falsely stated that collecting call records requires a warrant; I have corrected that statement in the post below.]

I read with interest David Simon's recent blog post in which he responds to revelations that the NSA has been collecting the call records of all American mobile phone users.

David Simon, of course, created the Wire, a television series where institutions take on lives of their own and defy attempts by well-meaning people to reform them from within. So it came as a real shock to find Simon criticizing pundits who have objected to the extent of NSA surveillance, and accusing them of wilful ignorance about the nature of police work.

Mr. Simon pointed out that law enforcement agencies have been allowed to capture call records for decades, including in cases where the information harvested includes calls from people who are not under suspicion. In other words, there's nothing new going on to get worked up about.

Having labored as a police reporter in the days before the Patriot Act, I can assure all there has always been a stage before the wiretap, a preliminary process involving the capture, retention and analysis of raw data. It has been so for decades now in this country. The only thing new here, from a legal standpoint, is the scale on which the FBI and NSA are apparently attempting to cull anti-terrorism leads from that data. But the legal and moral principles? Same old stuff.

Seeing no difference in principle, only a difference in degree, in the NSA's surveillance program, Simon expresses annoyance with Americans who demand total protection from terrorism and then purport to be shocked when their government takes their requests seriously.

Mr. Simon cites the specific example of an investigation he covered as a police reporter in Baltimore in the 1980's. Criminals were using pay phones and pagers to evade detection, and tracking them down required indiscriminately recording numbers dialed from those pay phones, with the goal of sifting through the data later to find the pager numbers.

He argues that this kind of investigation, which targeted pay phones, was in some ways more invasive than the kind of tracking the NSA is accused of, since people expect to be anonymous when using a pay phone in a way that doesn't apply when they're calling from their own cell.

There is certainly a public expectation of privacy when you pick up a pay phone on the streets of Baltimore, is there not? And certainly, the detectives knew that many, many Baltimoreans were using those pay phones for legitimate telephonic communication. Yet, a city judge had no problem allowing them to place dialed-number recorders on as many pay phones as they felt the need to monitor, knowing that every single number dialed to or from those phones would be captured. So authorized, detectives gleaned the numbers of digital pagers and they began monitoring the incoming digitized numbers on those pagers — even though they had yet to learn to whom those pagers belonged. The judges were okay with that, too, and signed another order allowing the suspect pagers to be “cloned” by detectives, even though in some cases the suspect in possession of the pager was not yet positively identified.

I think Simon's fundamental argument, “same old stuff”, is mistaken in a number of important ways, and that some of this reflects our failure as technologists to communicate what modern surveillance can do.

First, there is the scope of the order. The Baltimore operation, and others like it, were limited to a specific criminal investigation. They were obtained under a warrant under a subpoena setting limits on what would be collected, and for how long.

The NSA program is universal and appears to be open-ended. Information is collected in aggregate. The program operates under the authority of secret court order, not a warrant. It is not clear whether the Administration even believes this type of surveillance requires a court order.

Second is the nature of the body carrying out the surveillance. In Simon's example, this was a municipal police force, overseen by a local court.

In our case, it's the NSA, a Federal agency whose job has traditionally been to collect foreign signals intelligence . The operation is overseen by a secret court system called FISC.

Third is the nature of the data being collected. When the Baltimore investigation took place, it collected a simple list of telephone numbers dialed from the monitored phones.

Modern call records contain much more data, reflecting the fact that almost all of us carry cell phones. A call record now includes unique device identifiers, routing information, cell tower IDs, and a wealth of additional information about the circumstances and location of the call. The location data is particularly powerful, turning mobile phones into de facto tracking devices whenever they are turned on.

Fourth is the question of oversight. The evidence used in the Balitmore case was collected by municipal police and presented (I'm assuming) in open court. Those against whom it was used had the chance to mount a defense, appeal the verdict to state and Federal courts, and enjoyed the presumption of innocence guaranteed to them by the Constitution.

The NSA call data is collected and used in secret. The agency is overseen as part of the very large national security establishment by a small, overworked group of legislators and senior government officials who have the requisite security clearance.

So I contend that the parallel Simon makes is false. The NSA is not a law enforcement organization, and intuitions from police reporting don't carry over.

But even if we grant the analogy, I think there's a more dangerous argument in Simon's essay, which is the contention that two programs that differ only in degree are necessarily "the same old thing". I believe this is not a safe assumption to make when talking about computers and their use in domestic surveillance.

In the portion of his essay that excited the most comment, Simon appears to express disbelief that the NSA can make broad use of the data it gathers:

When the government grabs every single fucking telephone call made from the United States over a period of months and years, it is not a prelude to monitoring anything in particular. Why not? Because that is tens of billions of phone calls and for the love of god, how many agents do you think the FBI has? How many computer-runs do you think the NSA can do — and then specifically analyze and assess each result?

Well, of course, the answer is "you would not believe how many 'computer-runs' the NSA can do". I believe this part of the essay especially caught tech people's attention, since it suggested that Simon might be naive about the capabilities of a modern datacenter. It's certainly the part Clay Shirky pounced on in his rebuttal.

But Simon is not a fogey who doesn't understand how powerful computers have become (though I feel that there are such people in positions of oversight in the House and Senate). I believe his error is in assuming that the analysis of these 'computer-runs' is any kind of bottleneck. There are powerful techniques for surfacing interesting features in any comprehensive list of interactions between human beings. I've written in the past about my distaste for the 'social graph' and the perverse worldview it imposes on our projects, but part of the appeal of that worldview is the real power of mathematics applied to exactly this kind of data. The analysis can be automated, and no good comes of it.

In a beautiful worked example, Kieran Healey has shown how a precocious British intelligence service could have identified Paul Revere as a person of particular interest based only on a set of membership lists of organizations he belonged to.

The point is, you don't need human investigators to find leads, you can have the algorithms do it. They will find people of interest, assemble the watch lists, and flag whomever you like for further tracking. And since the number of actual terrorists is very, very, very small, the output of these algorithms will consist overwhelmingly of false positives.

It's at this point that Simon's logic starts to work in the other direction. Given a long list of potential leads, investigators are going to focus on vetting the most likely, rather than taking any steps to clear false positives out of system. The penalty for missing a real terrorist is catastrophic, while the penalty for falsely accusing someone (when not only the accusation, but the very existence of the program, is secret) is nonexistent, even if the secret accusation ends up doing real harm. Limits on manpower won't constrain the investigation; they will only reduce its overall quality.

This isn't an abstract argument. We are all familiar with the tenebrous no fly list, a document that prevents several thousand people from traveling by air, and condemns thousands more to intrusive security measures each time they want to get on a plane. After 2001, this list rapidly expanded to thousands of names, with no avenues of appeal and no way to even check whether your name appeared in the document, to the point where the government finally had to improvise a 'redress' policy for travelers who found themselves living out a Kafka novel.

Characteristically, proposals for fixing the no-fly list and similar watch lists now call for collecting even more information, to help disambiguate people who share a name but not a date of birth with someone on the watch list. The basic problem—that lists of suspects are generated without accountability, without oversight, and with no incentive to avoid mistakes—persists.

There's also a more dangerous institutional problem to consider. When a system like this exists, it creates pressure for its own use. What is the point, after all, of having a very elaborate, extremely expensive database if you are only ever going to use it in exceptional cases? It is the nature of law enforcement to want to go after bad guys with all available tools. We saw a vivid demonstration of this in the years after the 2001 attacks, when the administration attempted to blur the lines between the 'War on Drugs' and the 'War on Terror', arguing that the proceeds from narcotics sales paid for terrorism.

Consider, too, a technique that has become standard in Federal investigations. It is a felony to make false statements to a Federal agent, and investigators routinely make use of this fact to gain leverage over a witness or suspect. People tend to be nervous when they talk to police, and unless they know better are liable to give inconsistent answers during questioning. Good interrogators can convert each of these inconsistencies into a felony count. Imagine how much more potent this tactic becomes when investigators can gain access to a database of your movements and contacts for the past decade.

The security state operates as a ratchet. Once you click in a new level of surveillance or intrusiveness, it becomes the new baseline. What was unthinkable yesterday becomes permissible in exceptional cases today, and routine tomorrow. The people who run the American security apparatus are in the overwhelming majority diligent people with a deep concern for civil liberties. But their job is to find creative ways to collect information. And they work within an institution that, because of its secrecy, is fundamentally inimical to democracy and to a free society.

I can't believe that David Simon, of all people, doesn't see the danger inherent in a permanent domestic surveillance program. I doubt that he would support a government initiative for all Americans to wear tracking devices in the name of fighting terrorism. Yet the NSA data collection program, whose output is functionally identical, seems not to trip the same alarm bells with him.


In public statements, the NSA director has defended domestic surveillance as a vital tool in preventing terrorism.

The term 'terrorism' is a magic word, unlocking government powers we normally associate with wartime. The current and previous Administration have, at various times, asserted the right of the government to conduct invasive and open-ended surveillance on people it suspects of terrorism, detain suspects in terrorism cases indefinitely without trial, 'render' them to countries for interrogation and torture, kill people it considers terrorists, including American citizens, with giant flying robots, or keep such people alive against their own will.

This is total power over human life. The authorities assure us that numerous checks exist to prevent abuses of this power, but of course the checks are also classified. The government is promising that the secret police won't put innocent people in the secret prisons because the secret courts would never allow it.

This system puts enormous pressure on a small group of fallible human beings. For the secrecy to work, the number of people in on the secret must be small . But this group is all part of the same hierarchy, subject to the same pressures, and unable to communicate its concerns outside the same closed circle.

Talk of secret prisons, indefinite detention, and force-feeding can sound tendentious (though it's all uncontested public record!). Americans have a deep faith in the rule of law and have not proven receptive to the argument that truly innocent people will find themselves placed in the "terrorist" category by accident.

There is a tendency among those who grew up under the rule of law to treat it like the Rock of Ages, an immovable substrate in which all the institutions of the state are forever anchored. And so even ordinarily skeptical people tend to assume that the government obeys its own laws when no one is looking. To an astonishing extent, and to the great credit of American civic life, this is actually true.

But I think a better metaphor for the rule of law is that it is the soil in which democratic institutions take root. Like the soil, it can be depleted. And once depleted, it is not easily replenished.

Secrecy erodes the rule of law because it makes democratic accountability impossible. Secrets can't be held too broadly, so secrecy concentrates responsibility and asks too much of human nature. That is why every intelligence agency, unless given rigorous outside oversight, commits terrible excesses.

I think Simon agrees about the perniciousness of this secrecy. In a later rebuttal he's called for a modern-day version of the Church Committee, a group of people from outside the security establishment with top-secret access and the power to compel testimony.

And I agree with Simon that the current state of affairs is the "inevitable consequence of legislation that we drafted and passed."

American politics since the Cold War has operated under the conceit that national security must transcend partisan differences. And so we have seen large bipartisan majorities voting for pre-emptive war and domestic surveillance even though both of those policies were highly controversial outside Congress.

This tradition has created a vast space beyond political accountability. When both political parties pursue a nearly identical policy, there are no electoral consequences when the policy proves disastrously wrong. Who do you vote against?

People have good intuitions about the danger of indiscriminate collection and retention of their data. They're not being hysterical. For the last decade, we've been concentrating on how to regulate the way this data gets used in the private sector. But now that the coercive power of the state has entered the picture, the stakes are much higher, and we have an opportunity to politicize the debate. David Simon tells us to resign ourselves to the consequences of technological change:

"The question is not should the resulting data exist. It does. And it forever will, to a greater and greater extent."

But I think that is wrong. Whether the data should exist, and for how long, is exactly the question. The answer is not a technological inevitability, but a political choice.

I believe a world in which everything is recorded and persists forever carries the seeds of something monstrous . It is in the nature of computer systems to remember things indefinitely, but there's nothing difficult about programming machines to forget. It just requires laws to do it. We can't treat it as a technical problem. And to get the laws passed, we need to politicize the issue.

Still, these barricades are going to seem awfully lonely if we can't even get David Simon up there with us. The man should be a natural ally, and the fact that he sounds so exasperated troubles me. The fact that he seems resigned to a future of total information retention troubles me. The fact that we are talking past each other troubles me most of all.

Simon also mentions the FBI, but it's unclear to me that this agency has anything to do with the accusations of widespread call monitoring.

The expensive part is keeping everything secret, and staffing it with people cleared for such access. The database itself is likely quite modest in size.

The Washington Post has estimated the number of people with Top Secret clearance at 854,000. The number of people with full knowledge of all secret programs is much smaller, as this information is carefully compartmentalized.

Except Pinboard archives. Those are great!

—maciej on June 15, 2013

Berlin Meetup Aftermath

Our meetup in Berlin yesterday proved to be the best-attended one yet, with fifteen Pinboard users and one extraordinarily patient child fighting their introvert instincts on a beautiful spring day.

We had a diverse group of people from across the US and Europe, ranging from designers and web developers, to game developers and even a mobile app developer! Luckily everyone found a common language.

In order not to drink beer on an empty stomach, I stopped in a Bavarian restaurant before the meetup and ordered the item below, about which I would like to make the following observations:

  1. Wrapping this thing in bacon would actually make it healthier.
  2. It appears to be made from the heart of the last person to order it.
  3. Clicking 'enhance' on this photo crashes iPhoto and turns its icon into a ham hock.
  4. It's hard to see, but the parsley is hovering a few millimeters above the meat field.
  5. If I ever visit Bavaria, back up your bookmarks.

I sincerely thank everyone for coming and hope to see you in Berlin again soon!

—maciej on June 09, 2013

Berlin Meetup

I'll be in Berlin on Saturday, June 8, and invite all Pinboard users in the area to meet me at the PraterGarten at 16:00 for sausages and light chat.

Some people have asked what goes on at these Pinboard meetups. Topics typically discussed are the local beer, whether to order more of it, and who everyone is and how they are living their lives so far.

I try to avoid computer talk unless it's clearly necessary. Of course, if there are bugs or flaws in Pinboard that you find particularly oppressive, this is the perfect chance to hold me accountable.

Please RSVP if you would like to come so I have a rough idea of head count and warn you if plans change.

—maciej on May 31, 2013

Stockholm Meetup

Warm thanks to Massimo, Stella and Erik for coming out on short notice for a nice Pinboard lunch in Stockholm. That makes two librarians and a complexity theorist, which seems like a representative sample of users.

It's a big treat for me to meet Pinboard people while I travel; it not only makes the world feel like a welcoming place, but also helps the whole project less abstract.

The lunch brought with it a heavy blow - the kitchen was out of the famous kötteboller, or Swedish meatballs—but my new friends helped me cope.

—maciej on May 30, 2013

Seeking a Summer Pintern

I'm looking for someone to work with me on Pinboard from June 15 to September 15, 2013.

This is a remote position. You'll set your own hours and coordinate with me online.

The pinternship pays a modest stipend of 6,000 USD. More valuable is the chance to develop your skills as a generalist web developer. ‘Generalist’ may not sound exciting, but it's actually a rare person who understands a production web app from top to bottom.

You'll spend three months learning every aspect of running Pinboard, a reasonably complex website with about 20,000 active users. Each layer of the ‘web app stack’ will come to feel like a close and trusted friend. Then that friend will betray you, and you'll practice extracting the knife from your back as you hide your tears from an uncaring world.

This skill, once properly developed, will prove useful in landing high-paying computer jobs.

What You'll Do

Part of your job will be to help me with the day-to-day operation of the site, including troubleshooting and finding ways to automate time-consuming work.

The rest of the time you'll spend on a series of projects in areas of your choice. We'll spec out the projects together, you'll build them, and they'll go live on the site. Then we will scramble to fix them.

Here are some examples of things you could work on:

  • Care and feeding of a production MySQL database
  • Caching, including memcache, varnish and pound.
  • Helping me implement version 2 of the Pinboard API.
  • Hacking on the Sphinx-based search engine.
  • Writing a better web crawler.
  • Making improvements to the job queue and scheduling system.
  • Writing tools for text parsing and analysis.
  • Machine learning and classification.
  • Building custom UI components in Javascript.
  • Writing browser plugins for Chrome, Firefox, or Safari.
  • Finding security holes and patching them.
  • Deployment scripts and emergency checklists
  • Improving server monitoring tools

What you work on will depend in large part on what you want to learn.

Over the course of the summer, you'll have a chance to get intimately familiar with the different components of a modern web application (hardware, operating system, database, network, application, and cache) and how they fail to work together.

You'll have the satisfaction of building things that benefit real people.

Ideally, you'll gain useful experience and earn a small amount of money while not completely destroying my livelihood.


I don't care where you live, but you must be at least eighteen.

You must be highly autonomous and good at muddling through problems. If you are easily frustrated, you will not enjoy working on Pinboard.

You should know your way around a Linux system and be proficient in at least one programming language.

You have to be somewhere where I can meet with you in person before June 15. This means Paris, Warsaw, Gdańsk, or the San Francisco Bay Area.

You should have a strong work ethic and lots of enthusiasm. Enough for two people.

How To Apply

Send me a link to a webpage with the following info:

  • Who you are, and where you live.

  • What you're good at already.

  • What you'd like to become good at this summer.

  • A feature that you'd like to see added to Pinboard.

  • Any project you've worked on that you're particularly proud of.

  • For super extra credit, your solution to problem #17 in the Matasano crypto challenges.

The webpage should be served over TLS and include a custom HTTP header called 'Pinternship', with its value set to an emoticon of your choice.

If I really like your application, I will ask you to give me a couple of personal references. I'll also ask you to meet me for an in-person interview.

Thank you!

—maciej on April 28, 2013

The Matasano Crypto Challenges

I recently took some time to work through the Matasano crypto challenges, a set of 48 practical programming exercises that Thomas Ptacek and his team at Matasano Security have developed as a kind of teaching tool (and baited hook).

Much of what I know (or think I know) about security has come from reading tptacek's comments on Hacker News, so I was intrigued when I first saw him mention the security challenges a few months ago. At the same time, I worried that I'd be way out of my depth attempting them.

As a programmer, my core strengths have always been knowing how to apologize to users, and composing funny tweets. While I can hook up a web template to a database and make the squigglies come out right, I cannot efficiently sort something for you on a whiteboard, or tell you where to get a monad. From my vantage point, crypto looms as high as Mount Olympus.

To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I've become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security.

Since the challenges aren't really documented anywhere, I wanted to describe what they're like in the hopes of persuading busy people to take the plunge.

You get the challenges in batches of eight by emailing cryptopals at Matasano, and solve them at your own pace, in the programming language of your choice. Once you finish a set, you send in the solutions and Sean unlocks the next eight. (Curiously, after the third set, Gmail started rejecting my tarball as malware.)

Most of the challenges take the form of practical attacks against common vulnerabilities, many of which will be sadly familiar to you from your own web apps. To keep things fun and fair for everyone, they ask you not to post the questions or answers online. (I cleared this post with Thomas to make sure it was spoiler-free.)

The challenges start with some basic string manipulation tasks, but after that they are grouped by theme. In most cases, you first implement something, then break it in several enlightening ways. The constructions you use will be familiar to any web programmer, but this may be the first time you have ever taken off the lid and looked at the moving parts inside.

Here are the cryptographic topics covered:

Going into the challenges, I worried that my math wouldn't be up to the task. My impression of Serious Crypto was that it required all kinds of group theory, abstract algebra, elliptic curves, vector spaces, and other scary stuff. But while this may be true, the math content for the practical challenges was much gentler:

While the math concepts weren't hard, getting a real feel for them took work (and this was the point of the exercise).

If you're an experienced programmer, the Matasano challenges are also a terrific excuse to try a new programming language. It's always much more fun to solve real problems than it is to write a Manager object that inherits from Employee.

Here are the language features I found myself using most:

  • string manipulation (ranges, substrings)
  • bitwise operators
  • lookup hashes
  • conversion between string and number formats
  • big integer operations
  • packing and unpacking binary data
  • pattern matching
  • url manipulation
  • client/server interaction over a socket

Altogether it took me about three weeks to do the full cycle, working pretty intensively. Skilled programmers will find the going much faster, especially if you're comfortable with bit twiddling. Very few of the problems were downright hard, though some required several hours of work. I spent most of my time stepping through algorithms in pursuit of bugs, and in the process really got a feel for the moving parts in various cryptographic constructions.

I would compare the experience to having only ever read cookbooks and watched cooking shows, and then being asked to fry an egg. You know exactly what to do... in principle.

Some of the challenges have a payoff, in that you decrypt a short bit of secret text. This is incredibly fun. Seeing a cracked message come up on the screen after an evening of bug chasing reminded me of how it felt to be a kid in front of my Apple ][, finally getting it to beep or draw a circle or print DONGS all over the screen. Some of the later challenges even display the answer 'Hollywood style', where you get to see it decrypt one letter at a time in a cascade of print statements.

While the rules don't stipulate it, I think it's a good idea not to look at anyone's code if you try the challenges. The goal here is to convert message-board levels of understanding into actual knowledge, and the only way that works is if you bang your head on the task without seeing how anyone else has done it. Sean was really helpful in helping me navigate difficult spots, and the challenges are not set up to intentionally trick you. But you will need the kind of graph paper with the small squares.

What surprised me most:

  1. How practical these attacks were. A lot of stuff that I knew was weak in principle (like re-using a nonce or using a timestamp as a 'random' seed) turns out to be crackable within seconds by an art major writing crappy Python.

  2. There is no difference, from the attacker's point of view, between gross and tiny errors. Both of them are equally exploitable. In at least three challenges, the mere fact of getting distinguishable error messages was enough to recover the entire message.

  3. This lesson is very hard to internalize. In the real world, if you build a bookshelf and forget to tighten one of the screws all the way, it does not burn down your house

  4. Timing attacks are much more effective than I imagined.

  5. Someone who can muck with your ciphertext is halfway to reading it, possibly with your secret key for dessert.

  6. Some mistakes are incredibly non-obvious. I had no idea you had to super-carefully pad RSA, for example.

  7. Even on a laptop, in 10 minutes you can do a terrifying amount of computation. It really is 2013.

I mentioned earlier that I thought every web programmer should try their hand at these. It is very illuminating to look at your own web app from the vantage point of an attacker actually writing code. At the very least, you will never be confused about cipher block modes again, or have to worry that someone will ask you to explain how a public key works in an interview. And there is a whole slew of dumb mistakes you will now avoid (replacing them with smarter mistakes that will become the subject matter of challenges 48-96).

The best part, from a web app developer's perspective, is that you never once write a SQL statement or HTML tag.

Here are some specific lessons from the challenges that I will apply to my own work:

  1. Keep meaningful data out of tokens (like cookies) that I hand out to clients. Use random values keyed against a database, memory store, or wherever.

  2. If I have to put data in tokens, include an integrity check, and pay a real crypto person to vet it.

  3. I must never seed a PRNG with a timestamp. I used to do this with microsecond precision thinking I was being clever. Then I went ahead and wrote a script that guessed the seed value in just a few seconds, and now I will never do that again.

  4. Use constant-time string comparisons when testing incoming data against some target value for authentication purposes. This is easy enough to do in most languages to make it cheap insurance.

  5. Anything related to authentication should only fail in one way. I must not provide distinguishable errors to the user.

  6. If possible, find a way to log the fact that someone is making a lot of weird queries against my site. For extra points, try not to make the logger itself hackable.

  7. No third-party javascript. I hated it already, now I hate it more.

  8. Cut off one of my fingers each time I re-use a nonce.

Having read this post, you can go to Hacker News and comment in Talmudic detail about what is right or wrong in the conclusions I drew. But a much better idea is to just email Sean and have a crack at the challenges yourself. You will have a good time!

One final observation. Crypto is like catnip for programmers. It is hard to keep us away from it, because it's challenging and fun to play with. And programmers respond very badly to the insinuation that they're not clever enough to do something. We see the F-16 just sitting there, keys in the ignition, no one watching, lights blinking, ladder extended. And some infosec nerd is telling us we're can't climb in there, even though we just want to taxi around a little and we've totally read the manual.

Doing these challenges is a great way to 'shake your sillies out', as Raffi might say, without hurting yourself or your users. You get to put on the flight suit, climb into the simulator, and crash that plane in every conceivable way.

I would like to sincerely thank Thomas and Sean and everyone at Matasano who worked on these challenges, and implore people in other technical fields to consider offering something similar. It's the most fun I've had programming in years!

—maciej on April 18, 2013

« earlier later »

Pinboard is a bookmarking site and personal archive with an emphasis on speed over socializing.

This is the Pinboard developer blog, where I announce features and share news.

How To Reach Help

Send bug reports to

Talk to me on Twitter

Post to the discussion group at pinboard-dev

Or find me on IRC: #pinboard at