Whither pagers?

17 Mar

Earlier this week we had a service outage. The proper chain of events would be:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notice problem, send page to admin’s phone
  • 00:00:10 Phone rings with new message
  • 00:00:30 Admin logs in to server, fixes problem
  • 00:01:00 Problem resolved

But what happened was:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notices problem, send page to admin’s phone
  • 00:00:10 T-Mobile doesn’t deliver the message

This started around 3am California time, which is why none of the PBwiki team noticed it independent of the sms alert mechanism. What should have been an isolated transient, simple to resolve and not user-visible turned into a cascade of unpleasant timeouts which caused the service to slow and eventually halt. We’ve done an extensive internal examination of what happened, and we’re changing some technology, adding some additional automated checks, and doing a few procedural things more intelligently.

The main process change is something that is probably old hat for old-school ops people — the absence of a page alert is not an indication of systemwide health. We’ve deployed a lot of new infrastructure in the last few weeks, and I’d been getting occasional pages for a while, but none for the prior day or two. I’ve set up the daily equivalent of the Tuesday-at-noon air raid siren test — in which the absence of a message every morning will be a problem itself. We’ve also got independent Nextel phones for on-call ops folks so there are now several routes for the alarm pages to take, plus that funny push-to-talk thing so we can annoy one another at all times.

4 Responses to “Whither pagers?”

  1. bcoho July 26, 2007 at 7:24 am #

    7/26, 11:15 EST I keep getting the “slow down” message for robots – at this time, I simply can’t access our wiki at all. Is there an “event” at pb wiki?

  2. Nathan Schmidt July 26, 2007 at 10:03 am #

    I’ve replied over email as well but here’s some data for reference —

    Your browser sends us this User-Agent string, which our software classifies as being a likely robot:

    “Mozilla/5.0 (000000000; 0; 00000 000 00 0; 00000; 0000000000) 00000000000000 000000000000000″

    Do you have any idea why that would be sent instead of something more common such as:

    “Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4″

  3. Negara Islam May 27, 2014 at 12:38 pm #

    Howdy! This is my first comment here so I just wanted to give a
    quick shout out and tell you I really enjoy reading your articles.
    Can you recommend any other blogs/websites/forums that deal with
    the same subjects? Thanks!

Trackbacks/Pingbacks

  1. Power Outage at Rackspace Brings Down Laughing Squid Servers | Laughing Squid - November 12, 2007

    [...] 2: 37signals is hosted at the DFW data center and was down as well. photo via PBwiki Blog Related PostsRackspacePower Outages In San Francisco Bring Down Major WebsitesLaughing Squid in [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 88 other followers

%d bloggers like this: