Earlier this week we had a service outage. The proper chain of events would be:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notice problem, send page to admin’s phone
  • 00:00:10 Phone rings with new message
  • 00:00:30 Admin logs in to server, fixes problem
  • 00:01:00 Problem resolved

But what happened was:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notices problem, send page to admin’s phone
  • 00:00:10 T-Mobile doesn’t deliver the message

This started around 3am California time, which is why none of the PBwiki team noticed it independent of the sms alert mechanism. What should have been an isolated transient, simple to resolve and not user-visible turned into a cascade of unpleasant timeouts which caused the service to slow and eventually halt. We’ve done an extensive internal examination of what happened, and we’re changing some technology, adding some additional automated checks, and doing a few procedural things more intelligently.

The main process change is something that is probably old hat for old-school ops people — the absence of a page alert is not an indication of systemwide health. We’ve deployed a lot of new infrastructure in the last few weeks, and I’d been getting occasional pages for a while, but none for the prior day or two. I’ve set up the daily equivalent of the Tuesday-at-noon air raid siren test — in which the absence of a message every morning will be a problem itself. We’ve also got independent Nextel phones for on-call ops folks so there are now several routes for the alarm pages to take, plus that funny push-to-talk thing so we can annoy one another at all times.